What Is a Bloom Filter? Working, Functions, and Applications
Bloom filters check if an element is most definitely not in a dataset through hashing functions and an array of bits.
- Bloom filter is defined as a data structure used to check whether an element is most definitely not in a dataset by using one or more hashing functions and an array of bits. It is called a filter, as it acts as a preliminary test for data entry.
- This type of algorithm is widely used in scenarios where false positives are acceptable but false negatives are not. For instance, it is okay (if cumbersome) to raise a false alarm for a security event that has not happened but unacceptable to overlook a real event (i.e., false negative).
- This article explains how Bloom filters work and their key applications in areas such as cybersecurity, P2P networking, and everyday UX and authentication operations.
Table of Contents
What Is a Bloom Filter?
Bloom filter is a data structure used to check whether an element is most definitely not in a dataset by using one or more hashing functions and an array of bits. It is called a filter, as it acts as a preliminary test for data entry.
3-Step Bloom Filter Process: Hashing and Insertion, Lookup, and Search Result
Source: ResearchGate
In numerous situations in computer science, we look for a small quantity of data stored in an enormous reservoir. The task of a software engineer is to optimize this search. They constantly look for new data structures, technologies, and processes to make the search process work with minimal latency and high throughput. A Bloom filter assists in optimizing the search operation in specific use cases.
Let’s assume you are setting up a new account on a social media website to communicate with your peers. When you input a username, a message saying, “Sorry, that username is already in use,” appears. You added your date of birth to your username but to no effect. Here, a Bloom filter algorithm comes into play.
It calculates the possibility of whether the username is already taken and tells you NO; similar data was already entered before.
A Bloom filter is a space-efficient probability data model used to determine if a constituent is an element of a set. This suggests that this algorithm is primarily employed for detecting duplicate events. Checking the availability of a username is an example of a set membership challenge, wherein the set consists of an inventory of all enrolled usernames.
In the realm of big data, content is generated at a rate that makes it difficult to process it efficiently. Using algorithms like Bloom filters, we can rapidly identify and eliminate identical events or information, making datasets more manageable.
To understand Bloom filters better, let’s first look at the concept of hashing.
A hash is similar to a data fingerprint. A hash function accepts data of any length as input. It provides an identifier of a shorter (generally), fixed (generally) value that can be used to index, contrast, or recognize the data.
In other words, hashing algorithms are processes that produce an outcome of fixed length (the hash or just hash value) originating from a specified input (the hash or hash value). The hash value becomes a figurative representation of the data itself.
A Bloom filter algorithm inserts the hash value into an array of a fixed size and “remembers” that the hash value is entered. When the user runs a lookup operation, the algorithm checks if the same hash value was definitely or possibly entered before and returns a NO result only when the data is completely new.
Bloom filters can be of various types:
- Compressed Bloom filters
- Spectral Bloom filters
- Space code Bloom filters
- Decaying Bloom filters
The development of Bloom filters
Burton Howard Bloom, a developer, designed Bloom filters in the 1970s. Bloom, an MIT Computer Science graduate, designed the filters to serve as a space-efficient probability data model that helps you determine whether an element or piece of data is an element of a set.
After its creation, the objective was to assemble a data classification tool by applying hashing algorithms, resulting in an identification output. At the same time, it enables the algorithm to respond with certainty if the component being examined is not one of the members of the set or if it has a chance to be a member.
See More: What Is Logistic Regression? Equation, Assumptions, Types, and Best Practices
Pros and cons of Bloom filters
The algorithm can detect duplicate occurrences across various databases and data categories. Let’s examine a few advantages offered by a Bloom filter.
- During entry and searches, the time complexity of the Bloom filter data framework is 0(k), where k is the maximum number of hash functions implemented. In computing, the complexity of time is the computational challenge defining the time required to execute an algorithm on a computer.
- Bloom filters have a space complexity of 0(m), wherein m is the total array capacity. The space complexity of a formula or computer program is the memory required to address a specific case of a computational challenge. Space complexity is generally determined contingent on the input’s characteristics.
- Unlike hash tables, which use a single hash function, Bloom filters employ numerous hash functions to avoid hash collisions. However, this is not a failsafe.
Nevertheless, there are a few major drawbacks to using Bloom filters:
- There are incorrect outcomes. This indicates that the method cannot always accurately determine whether an element exists in the collection. It never produces a false negative, however.
- Only the probability can be retrieved from the array, not the original data.
- The greater the number of hash functions, the slower the Bloom filter. However, if you have a small number, you may experience excessive false positives.
Inflexibility is a further disadvantage. Regardless of whether the Bloom filter size is just a few bits or hundreds of thousands of bits, it must be designated a unit of measurement during its development. Once a measurement has been identified, it will not shrink or expand outside of what was previously determined. For the Bloom filter to be successful, the amount of data that will be added must be stated or made obvious in advance.
Therefore, if the details are unknown, the Bloom filter would probably be created with just a handful of components less successful at managing the desired data. Or, it could be that an enormous bloom filter is created, requiring a large amount of storage capacity for a small quantity of data to be handled, resulting in a waste of storage space.
See More: What Is a Decision Tree? Algorithms, Template, Examples, and Best Practices
How Does a Bloom Filter Work?
Let us unpack the workings of a Bloom filter. Under the surface, a Bloom filter is nothing more than a sequence of bits wherein all bits are initially set to zero. Assume a Bloom filter of a measure of 19. The Bloom filter allows two types of actions as part of its functionality: insert and retrieval.
Here are the steps involved in the working of a Bloom filter:
How a Bloom Filter Works
1. Accept the input
The first step is to accept the input. In our example, let’s assume that the input is a string containing the text “John Doe.”
2. Calculate the hash value
Next, the algorithm performs hashing to convert John Doe into a corresponding numerical value. For the sake of our example, let’s assume that the value is 1355. The actual value is computed as per hashing algorithms, which vary in complexity.
3. Mod the hash by the array length
The next step is to mod the hash value by the length of the array (mod is how you find and store the remainder of a division problem). Mod in programming is denoted by %. When we perform the mod operation to John Doe or 1355, we get an index within the bounds of the bit array.
1355%19 = 6
4. Insert the hash
We insert the hash into the mod value of the array. Therefore, the sixth position in the array goes from 0 to 1.
5. Search for the value (i.e., lookup)
Steps 2 and 3 are performed again as part of the lookup process. This time, the algorithm checks the content of the array as per the mod results. If the value is 0, the input cannot conceivably belong to the set. Nonetheless, if the bit is 1, the input may be an element of a set. The operation (e.g., setting a password or creating an email ID) is allowed only when the output comes as 0.
How do false positives in Bloom filters work?
Bloom filter is a data structure that is both space- and time-efficient. However, this efficacy occurs at the expense of a probabilistic nature.
The definition of a false positive is yielding an outcome wherein the value of the key is not present in the array. It means that looking for an element that does not exist can return an incorrect result. Nevertheless, the array will never return an erroneous value for a key that belongs within the array; it is completely devoid of false negatives.
Due to hash collision, false-positive scenarios do occur. A collision is a randomized fit in hash values that occurs in computer science when a hashing algorithm generates an identical hash value for two different data elements. Multiple hash functions can be used to minimize the collision rate. Instead of setting a single bit for a single input, several bits are set. However, this can slow down the algorithm.
See More: A Simplified Explanation of Fuzzy Logic Applications
Functions of a Bloom Filter
Bloom filters are systems of data offering only two capabilities:
1. Insert an element into a set
To add an element, multiple hash functions must be employed to hash it. As explained in the previous section, the hash value is converted into a bit for insertion into the Bloom filter.
2. Query whether an element is in a set
When a query is posed to determine whether a specific data item exists, a hashed index or code (unique identifier) about that data item is examined. This is called the lookup process.
The distinguishing characteristic of Bloom filters is that when the response to a query is “YES,” it may still be inaccurate. However, answers of “NO” are always legitimate. The incorrect “YES” responses depend on probability. Their probabilities can be defined as an expression of the total amount of elements in the collection, the size of the Bloom Filter, and a parameter k known as “the total number of hash functions.”
In addition to the two functions we discussed, Bloom filters have certain properties that determine their functionalities:
- Unlike a hash table, a Bloom filter of a fixed dimension may indicate a set with a randomized large number of elements. This is a benefit of this algorithm’s type.
- Bloom filters never yield false negative results. They only generate false positives. This makes it useful for applications such as cybersecurity, where one would always rather err on the side of caution.
- Incorporating an element rarely ever fails. However, as elements are added, the rate of false positives increases until all bits within the filter are set to 1. After this juncture, every query will return a successful result.
- It could save assets. Numerous well-known databases implement Bloom filters to reduce the expensive disk lookups for nonexistent rows or columns. PostgreSQL, Apache Cassandra, Cloud Bigtable, etc., use this technique.
- Most methods, such as a simple array and a linked list, necessitate the storage of the item by itself, which is wasteful of memory. The data item isn’t retained by Bloom filters at all. They calculate a hash value and store its presence or absence as 1 or 0 in the array.
- We can’t eliminate a component in the Bloom filter. Hashmaps, attempts, straightforward arrays, and linked lists are better suited for deleting items.
- Bloom filters rely on hashing functions available in five varieties: the ideal hash function, double hashing or partitioned hashing, multiple hashing, and basic hash functions.
See More: What Is Data Analytics? Definition, Types, and Applications
Applications of Bloom Filters
Now that we know how Bloom filters work and their advantages and limitations, let us explore the use cases. The top eight applications of Bloom filters include:
1. Checking for email ID availability
Let’s assume you are setting up a new Gmail account. Google must determine whether or not the ID you have provided is valid. Now, there are specific methods for doing this.
You can examine all the extant email addresses in its data repository (tens of thousands of datasets and cache servers) to determine whether or not a particular ID already exists. Imagine, however, that Gmail already stores billions of email addresses—is it practicable to scan countless servers to retrieve every new email address? Bloom filters permit the system to roughly estimate an individual’s ID status.
2. Ensuring the security level of a suspicious URL
Imagine you’re utilizing a cloud-based security system that prevents you from viewing malicious URLs. This service could store an archive of billions of potentially hazardous URLs and process several million requests every minute worldwide. In this situation, looking for a web address within the database or cache is impossible. Bloom filters facilitate a probability algorithm to rapidly determine if a URL is secure (i.e., not stored in the database).
3. Recommending new content
Each blog post on a website like Medium has a unique identifier and is retained in a tabular database. Even so, the table is too large and frequently viewed and cannot be accommodated on a single machine. Therefore, when a particular story is proposed to the user, the algorithm must determine whether it has already been suggested or perused. The Bloom filter comes into play at this point.
4. Saving storage space on social media platforms
Facebook employs Bloom filters to prevent what is known as a “flash in the pan” or a one-hit wonder. One-hit wonders are online artifacts that are merely looked for only once by users. For instance, queries for “coding” tend to be archived in local storage. However, if you only search for something once, such as “giraffe,” it shouldn’t be kept locally, given that it is a classic instance of a one-hit wonder. By applying a Bloom filter to identify a web object’s second request and storing it only after its second request, one can prohibit one-hit wonders from getting into the local storage.
5. Detecting weak passwords
Here, a system may maintain a Bloom filter-driven stock of insecure credentials. When a new user is added, the password is evaluated against the Bloom filter, and whenever a potential match is found, the user is notified. When a new user inputs a password or a current user modifies their password, the list of characters can be updated. Since passwords are saved in a hashed format, even when the Bloom filter database has been made public, user passwords are still secure.
6. Synchronizing cryptocurrency wallets
Bitcoin, a renowned cryptocurrency, employs the Bloom filter because of its exceptional performance. Additionally, it reduces the probability of distributed denial of service (DDoS) attacks in crypto.
In Bitcoin, all block information circulates between nodes. This data’s size causes the system to decelerate. The problem is that almost all received data is rejected. Consequently, Bloom filters are utilized to determine whether or not specific information will be expunged in the future, and consequently, a decision to move the data is arrived upon. This Bloom filter application is similar to Facebook’s data storage use case.
7. Tracing IP addresses
Identifying the device from which a transmission came is one of the difficulties of establishing internet protocol (IP) addresses. Even when there is no attempt to conceal the source, packet forwarding techniques make this extremely difficult. The answer is to employ a hash-based method to preserve audit traces that may be utilized to locate the source machine. Due to the tremendous scale of the internet network structure, Bloom filters are utilized for this purpose.
8. Supporting P2P networks
Bloom filters have become widely used in P2P settings for an assortment of tasks, including storing keyword-led queries and indices in a compressed manner, synchronizing collections over the network, and aggregating content. P2P networks require the transfer of keyword lists and additional metadata between nodes. This is a key application of Bloom filters.
See More: What Is a Data Warehouse? Definition, Architecture, Tools, and Applications
Takeaway
Even though we don’t always recognize it, Bloom filters help perform a number of the functions we use every day. They are widely used in recommendation engines like Netflix, social media platforms like Facebook, and nearly every database management system. Knowing how to write and run Bloom filter algorithms can help optimize software, particularly backend data operations. As the world becomes increasingly data-driven, structures like the Bloom filter will be fundamental in improving our data experiences.
Did this article help you understand the functioning of Bloom filters? Tell us on Facebook, X, and LinkedIn. We’d love to hear from you!
Image source: Shutterstock
MORE ON DATA
- What Is Data Modeling? Process, Tools, and Best Practices
- Top 10 Machine Learning Algorithms in 2022
- What Is Data Mining? Definition, Techniques, and Tools
- What Is Data Science? Definition, Lifecycle, and Applications
- What Is Artificial Intelligence (AI) as a Service? Definition, Architecture, and Trends