What Is a Bloom Filter? Working, Functions, and Applications

Bloom filters check if an element is most definitely not in a dataset through hashing functions and an array of bits.

September 25, 2023

A data scientist working on a Bloom filter implementation.
  • Bloom filter is defined as a data structure used to check whether an element is most definitely not in a dataset by using one or more hashing functions and an array of bits. It is called a filter, as it acts as a preliminary test for data entry.
  • This type of algorithm is widely used in scenarios where false positives are acceptable but false negatives are not. For instance, it is okay (if cumbersome) to raise a false alarm for a security event that has not happened but unacceptable to overlook a real event (i.e., false negative).
  • This article explains how Bloom filters work and their key applications in areas such as cybersecurity, P2P networking, and everyday UX and authentication operations.

What Is a Bloom Filter?

Bloom filter is a data structure used to check whether an element is most definitely not in a dataset by using one or more hashing functions and an array of bits. It is called a filter, as it acts as a preliminary test for data entry.

3-Step Bloom Filter Process: Hashing and Insertion, Lookup, and Search Result

3-Step Bloom Filter Process: Hashing and Insertion, Lookup, and Search Result

Source: ResearchGateOpens a new window

In numerous situations in computer science, we look for a small quantity of data stored in an enormous reservoir. The task of a software engineer is to optimize this search. They constantly look for new data structures, technologies, and processes to make the search process work with minimal latency and high throughput. A Bloom filter assists in optimizing the search operation in specific use cases.

Let’s assume you are setting up a new account on a social media website to communicate with your peers. When you input a username, a message saying, “Sorry, that username is already in use,” appears. You added your date of birth to your username but to no effect. Here, a Bloom filter algorithm comes into play.

It calculates the possibility of whether the username is already taken and tells you NO; similar data was already entered before.

A Bloom filter is a space-efficient probability data model used to determine if a constituent is an element of a set. This suggests that this algorithm is primarily employed for detecting duplicate events. Checking the availability of a username is an example of a set membership challenge, wherein the set consists of an inventory of all enrolled usernames.

In the realm of big data, content is generated at a rate that makes it difficult to process it efficiently. Using algorithms like Bloom filters, we can rapidly identify and eliminate identical events or information, making datasets more manageable.

To understand Bloom filters better, let’s first look at the concept of hashing.

A hash is similar to a data fingerprint. A hash function accepts data of any length as input. It provides an identifier of a shorter (generally), fixed (generally) value that can be used to index, contrast, or recognize the data.

In other words, hashing algorithms are processes that produce an outcome of fixed length (the hash or just hash value) originating from a specified input (the hash or hash value). The hash value becomes a figurative representation of the data itself.

A Bloom filter algorithm inserts the hash value into an array of a fixed size and “remembers” that the hash value is entered. When the user runs a lookup operation, the algorithm checks if the same hash value was definitely or possibly entered before and returns a NO result only when the data is completely new. 

Bloom filters can be of various types:

  • Compressed Bloom filters
  • Spectral Bloom filters
  • Space code Bloom filters
  • Decaying Bloom filters

The development of Bloom filters

Burton Howard Bloom, a developer, designed Bloom filters in the 1970s. Bloom, an MIT Computer Science graduate, designed the filters to serve as a space-efficient probability data model that helps you determine whether an element or piece of data is an element of a set.

After its creation, the objective was to assemble a data classification tool by applying hashing algorithms, resulting in an identification output. At the same time, it enables the algorithm to respond with certainty if the component being examined is not one of the members of the set or if it has a chance to be a member.

See More: What Is Logistic Regression? Equation, Assumptions, Types, and Best Practices

Pros and cons of Bloom filters

The algorithm can detect duplicate occurrences across various databases and data categories. Let’s examine a few advantages offered by a Bloom filter.

  • During entry and searches, the time complexity of the Bloom filter data framework is 0(k), where k is the maximum number of hash functions implemented. In computing, the complexity of time is the computational challenge defining the time required to execute an algorithm on a computer.
  • Bloom filters have a space complexity of 0(m), wherein m is the total array capacity. The space complexity of a formula or computer program is the memory required to address a specific case of a computational challenge. Space complexity is generally determined contingent on the input’s characteristics.
  • Unlike hash tables, which use a single hash function, Bloom filters employ numerous hash functions to avoid hash collisions. However, this is not a failsafe.

Nevertheless, there are a few major drawbacks to using Bloom filters:

  • There are incorrect outcomes. This indicates that the method cannot always accurately determine whether an element exists in the collection. It never produces a false negative, however.
  • Only the probability can be retrieved from the array, not the original data.
  • The greater the number of hash functions, the slower the Bloom filter. However, if you have a small number, you may experience excessive false positives.

Inflexibility is a further disadvantage. Regardless of whether the Bloom filter size is just a few bits or hundreds of thousands of bits, it must be designated a unit of measurement during its development. Once a measurement has been identified, it will not shrink or expand outside of what was previously determined. For the Bloom filter to be successful, the amount of data that will be added must be stated or made obvious in advance.

Therefore, if the details are unknown, the Bloom filter would probably be created with just a handful of components less successful at managing the desired data. Or, it could be that an enormous bloom filter is created, requiring a large amount of storage capacity for a small quantity of data to be handled, resulting in a waste of storage space.

See More: What Is a Decision Tree? Algorithms, Template, Examples, and Best Practices

How Does a Bloom Filter Work?

Let us unpack the workings of a Bloom filter. Under the surface, a Bloom filter is nothing more than a sequence of bits wherein all bits are initially set to zero. Assume a Bloom filter of a measure of 19. The Bloom filter allows two types of actions as part of its functionality: insert and retrieval.

Here are the steps involved in the working of a Bloom filter:

How a Bloom filter works

How a Bloom Filter Works

1. Accept the input

The first step is to accept the input. In our example, let’s assume that the input is a string containing the text “John Doe.”

2. Calculate the hash value

Next, the algorithm performs hashing to convert John Doe into a corresponding numerical value. For the sake of our example, let’s assume that the value is 1355. The actual value is computed as per hashing algorithms, which vary in complexity.

3. Mod the hash by the array length

The next step is to mod the hash value by the length of the array (mod is how you find and store the remainder of a division problem). Mod in programming is denoted by %. When we perform the mod operation to John Doe or 1355, we get an index within the bounds of the bit array.

1355%19 = 6

4. Insert the hash

We insert the hash into the mod value of the array. Therefore, the sixth position in the array goes from 0 to 1. 

5. Search for the value (i.e., lookup)

Steps 2 and 3 are performed again as part of the lookup process. This time, the algorithm checks the content of the array as per the mod results. If the value is 0, the input cannot conceivably belong to the set. Nonetheless, if the bit is 1, the input may be an element of a set. The operation (e.g., setting a password or creating an email ID) is allowed only when the output comes as 0.

How do false positives in Bloom filters work?

Bloom filter is a data structure that is both space- and time-efficient. However, this efficacy occurs at the expense of a probabilistic nature.

The definition of a false positive is yielding an outcome wherein the value of the key is not present in the array. It means that looking for an element that does not exist can return an incorrect result. Nevertheless, the array will never return an erroneous value for a key that belongs within the array; it is completely devoid of false negatives.

Due to hash collision, false-positive scenarios do occur. A collision is a randomized fit in hash values that occurs in computer science when a hashing algorithm generates an identical hash value for two different data elements. Multiple hash functions can be used to minimize the collision rate. Instead of setting a single bit for a single input, several bits are set. However, this can slow down the algorithm.

See More: A Simplified Explanation of Fuzzy Logic Applications

Functions of a Bloom Filter

Bloom filters are systems of data offering only two capabilities:

1. Insert an element into a set

To add an element, multiple hash functions must be employed to hash it. As explained in the previous section, the hash value is converted into a bit for insertion into the Bloom filter.

2. Query whether an element is in a set

When a query is posed to determine whether a specific data item exists, a hashed index or code (unique identifier) about that data item is examined. This is called the lookup process.

The distinguishing characteristic of Bloom filters is that when the response to a query is “YES,” it may still be inaccurate. However, answers of “NO” are always legitimate. The incorrect “YES” responses depend on probability. Their probabilities can be defined as an expression of the total amount of elements in the collection, the size of the Bloom Filter, and a parameter k known as “the total number of hash functions.”

In addition to the two functions we discussed, Bloom filters have certain properties that determine their functionalities:

  • Unlike a hash table, a Bloom filter of a fixed dimension may indicate a set with a randomized large number of elements. This is a benefit of this algorithms type.
  • Bloom filters never yield false negative results. They only generate false positives. This makes it useful for applications such as cybersecurity, where one would always rather err on the side of caution.
  • Incorporating an element rarely ever fails. However, as elements are added, the rate of false positives increases until all bits within the filter are set to 1. After this juncture, every query will return a successful result.
  • It could save assets. Numerous well-known databases implement Bloom filters to reduce the expensive disk lookups for nonexistent rows or columns. PostgreSQL, Apache Cassandra, Cloud Bigtable, etc., use this technique.
  • Most methods, such as a simple array and a linked list, necessitate the storage of the item by itself, which is wasteful of memory. The data item isn’t retained by Bloom filters at all. They calculate a hash value and store its presence or absence as 1 or 0 in the array.
  • We cant eliminate a component in the Bloom filter. Hashmaps, attempts, straightforward arrays, and linked lists are better suited for deleting items.
  • Bloom filters rely on hashing functions available in five varieties: the ideal hash function, double hashing or partitioned hashing, multiple hashing, and basic hash functions.

See More: What Is Data Analytics? Definition, Types, and Applications

Applications of Bloom Filters

Now that we know how Bloom filters work and their advantages and limitations, let us explore the use cases. The top eight applications of Bloom filters include:

1. Checking for email ID availability

Let’s assume you are setting up a new Gmail account. Google must determine whether or not the ID you have provided is valid. Now, there are specific methods for doing this.

You can examine all the extant email addresses in its data repository (tens of thousands of datasets and cache servers) to determine whether or not a particular ID already exists. Imagine, however, that Gmail already stores billions of email addresses—is it practicable to scan countless servers to retrieve every new email address? Bloom filters permit the system to roughly estimate an individual’s ID status.

2. Ensuring the security level of a suspicious URL

Imagine you’re utilizing a cloud-based security system that prevents you from viewing malicious URLs. This service could store an archive of billions of potentially hazardous URLs and process several million requests every minute worldwide. In this situation, looking for a web address within the database or cache is impossible. Bloom filters facilitate a probability algorithm to rapidly determine if a URL is secure (i.e., not stored in the database).

3. Recommending new content

Each blog post on a website like Medium has a unique identifier and is retained in a tabular database. Even so, the table is too large and frequently viewed and cannot be accommodated on a single machine. Therefore, when a particular story is proposed to the user, the algorithm must determine whether it has already been suggested or perused. The Bloom filter comes into play at this point.

4. Saving storage space on social media platforms

Facebook employs Bloom filters to prevent what is known as a “flash in the pan” or a one-hit wonder. One-hit wonders are online artifacts that are merely looked for only once by users. For instance, queries for “coding” tend to be archived in local storage. However, if you only search for something once, such as “giraffe,” it shouldn’t be kept locally, given that it is a classic instance of a one-hit wonder. By applying a Bloom filter to identify a web object’s second request and storing it only after its second request, one can prohibit one-hit wonders from getting into the local storage.

5. Detecting weak passwords

Here, a system may maintain a Bloom filter-driven stock of insecure credentials. When a new user is added, the password is evaluated against the Bloom filter, and whenever a potential match is found, the user is notified. When a new user inputs a password or a current user modifies their password, the list of characters can be updated. Since passwords are saved in a hashed format, even when the Bloom filter database has been made public, user passwords are still secure.

6. Synchronizing cryptocurrency wallets

Bitcoin, a renowned cryptocurrency, employs the Bloom filter because of its exceptional performance. Additionally, it reduces the probability of distributed denial of service (DDoS) attacks in crypto.

In Bitcoin, all block information circulates between nodes. This data’s size causes the system to decelerate. The problem is that almost all received data is rejected. Consequently, Bloom filters are utilized to determine whether or not specific information will be expunged in the future, and consequently, a decision to move the data is arrived upon. This Bloom filter application is similar to Facebook’s data storage use case.

7. Tracing IP addresses

Identifying the device from which a transmission came is one of the difficulties of establishing internet protocol (IP) addresses. Even when there is no attempt to conceal the source, packet forwarding techniques make this extremely difficult. The answer is to employ a hash-based method to preserve audit traces that may be utilized to locate the source machine. Due to the tremendous scale of the internet network structure, Bloom filters are utilized for this purpose.

8. Supporting P2P networks

Bloom filters have become widely used in P2P settings for an assortment of tasks, including storing keyword-led queries and indices in a compressed manner, synchronizing collections over the network, and aggregating content. P2P networks require the transfer of keyword lists and additional metadata between nodes. This is a key application of Bloom filters.

See More: What Is a Data Warehouse? Definition, Architecture, Tools, and Applications

Takeaway

Even though we don’t always recognize it, Bloom filters help perform a number of the functions we use every day. They are widely used in recommendation engines like Netflix, social media platforms like Facebook, and nearly every database management system. Knowing how to write and run Bloom filter algorithms can help optimize software, particularly backend data operations. As the world becomes increasingly data-driven, structures like the Bloom filter will be fundamental in improving our data experiences.

Did this article help you understand the functioning of Bloom filters? Tell us on FacebookOpens a new window , XOpens a new window , and LinkedInOpens a new window . We’d love to hear from you!

Image source: Shutterstock

MORE ON DATA 

Chiradeep BasuMallick
Chiradeep is a content marketing professional, a startup incubator, and a tech journalism specialist. He has over 11 years of experience in mainline advertising, marketing communications, corporate communications, and content marketing. He has worked with a number of global majors and Indian MNCs, and currently manages his content marketing startup based out of Kolkata, India. He writes extensively on areas such as IT, BFSI, healthcare, manufacturing, hospitality, and financial analysis & stock markets. He studied literature, has a degree in public relations and is an independent contributor for several leading publications.
Take me to Community
Do you still have questions? Head over to the Spiceworks Community to find answers.