BloomFilter in SearchEngine (Solr)
“Never tell me the odds!” — Star Wars
“Is the username lucky already taken?” Or “I just want to check whether this productID exists in our search index?”. To answer such questions we have a suitable data structure called BloomFilter.
Bloom filters are space-efficient probabilistic data structure. It tells us that the element either definitely is not in the set or maybe in the set.
For understanding bloom filters, we must be familiar with hashing.
Hashing and Hash Function
In this part will cover a just of hashing, we cover in detail in another post.
In cooking, to hash something means to chop it and then mix it. A hash function is where a computer takes an input of any length and content (e.g. letters, numbers, and symbols) and uses a mathematical formula to chop it, mix it up, and produce an output of a specific length.
We will be using the non-cryptographic hash function Murmur3 in the Bloom Filter
Working of Bloom Filter
An empty bloom filter is merely a Bit Vector storing bits with all set to value as 0.
To add an element to the Bloom filter, we simply hash it by using our hash function and set the bits in the bit vector at the index of those hashes to 1.
To test for membership, you simply hash the string with the same hash functions, then see if those values are set in the bit vector. If the bits are 1 then the element is probably present, but if zero then the word is definitely not present.
Why this uncertainty in Bloom Filters
The question is why we mentioned it as “probably present”?
Let’s understand this with an example. Suppose we want to check whether “cat” is present or not. We’ll calculate hashes using h1, h2 and h3
h1(“cat”) % 10 = 1
h2(“cat”) % 10 = 3
h3(“cat”) % 10 = 7
If we check the bit array, bits at these indices are set to 1 but we know that “cat” was never added to the filter. The bit at index 1 and 7 was set when we added “geeks” and bit 3 was set we added “nerd”.
So, because bits at calculated indices are already set by some other item, bloom filter erroneously claim that “cat” is present and generating a false-positive result.
But as we mentioned, Bloom Filter gives confidence that this value does not occur in the set.
Metrics
Probability of False positivity
Let m be the size of the bit array, k be the number of hash functions and n be the number of expected elements to be inserted in the filter, then the probability of false-positive p can be calculated as:
Size of Bit Array
It’s a nice property of Bloom filters that you can modify the false positive rate of your filter. A larger filter will have less false positives, and a smaller one more.
If the expected number of elements n is known and desired false positive probability is p then the size of bit array m can be calculated as :
The optimum number of hash functions
The more hash functions you have, the slower your bloom filter, and the quicker it fills up. If you have too few, however, you may suffer too many false positives. It seems a difficult optimization problem, but fortunately, given an m and an n, we have a function to choose the optimal value of k
Bloom Filter in Search Engine
At Walmart ASDA we recently moved from using Endeca Search Engine to Solr for our Search. There we found a couple of cases where the Bloom Filter fits the best.
Case 1
While moving to search there is a case when the user can search the product id too in the same pipeline.
We have thought through various approaches to solve the problem efficiently:
1. Somehow determine its an ID-based search using regexes and redirect to a separate query creating a pipeline.
2. Add productID also as a searchable field and let it go with the common flow as of search which includes determining Search Rules, Faceting, Sorting and ReRanking and many more.
Approach 2 seems good to us, but the only drawback is that it will unnecessarily go through the complete flow, which is not even required.
So to identify whether it is a product id or not we choose bloom filter.
Case 2
In a normal search flow, when we have a lot of searchable fields to query upon in a collection It makes the query bulkier and most of the fields might be useless as they are not relevant for that token.
So to choose the relevant fields for a specific term we choose bloom filter.
Bloom filter variants
There are many variants/implementations of bloom filter has been present. Few of them are:
1. Google Guava
2. Orestes-Bloom-filter
3. Sangupta- Bloom-filter
4. Greplin-bloom-filter
We choose Sangupta Bloomfilter over Google’s Guava implementation
The Google Guava
bloom filter for few reasons cannot be persisted well, does not provide for a disk-backed bit array implementation, missing a counting bloom filter and last not the least the size of the payload.
This library, i.e Sangupta-Bloom-filter is inspired by the Guava
bloom filter implementation and uses a similar approach, with more extensions baked in.
Features
1. Uses pure Java murmur hash implementation as default hash function.
2. Various persisting methodologies — Java serialization, disk file etc.
Leveraging Bloomfilter inside Solr
To leverage the Bloom filter inside Solr, we create one component to hold the bloom filter which can be updated and queried from any pipeline.
Now to add the data in the Bloom filter we use the searcher and goes through the fields and read the data from the index and populates our bloom filter, as shown below.
We added the listeners on the new search and the first search event such that our bloom filter will always be updated once the index changes.
I have shared my code for Solr Plugin on Github.
Wrap up
- What is the Bloom filter?
- What are its use cases in search where the bloom filter fits in Search Domain?
- Various implementations of the bloom filter.