Bloom filter medicine

3/19/2024

Figures 1 and 2 illustrate the workings of a Bloom filter and the storage of molecules within such filters. This issue underscores the importance of randomness in hashing functions, often referred to as the avalanche effect. As a result, removing an element (by changing its positions from one to zero) could inadvertently affect other members with overlapping positions. However, this simplicity comes with certain drawbacks, such as the potential for two or more elements to be hashed to the same position in the Bloom filter (i.e., collisions). The filter can be queried to determine if a particular element has been added previously. The hashing functions must exhibit the following characteristics : (1) Quick computation (2) An avalanche effect, where minor input changes result in substantial and unpredictable output alterations, and (3) The generation of integers between 0 and m−1.īloom filters enable the addition of new members but do not support individual removals. These hashing functions generate k values ranging from 0 to m−1, which correspond to the positions in the bit vector where a ”1” will be assigned. In contrast, Bloom filter indices, even for extensive databases like ZINC, can comfortably reside in the main memory of everyday household devices, such as smartwatches or cellphones.Ī Bloom filter is initialized with an m-length bit vector, with all positions set to zero, and employs k independent hashing functions. However, LSH techniques often grapple with computational challenges as data scales, leading to memory requirements that can quickly exceed available main memory. On a different note, Locality-Sensitive Hashing (LSH) serves the specialized purpose of maximizing hash collisions to facilitate similarity searches. Other alternatives, such as Quotient filters and Count-Min sketches, also offer unique advantages and can be found in the literature. For example, Cuckoo filters provide the capability for dynamic item insertion and deletion, a feature absent in conventional Bloom filters. Although, we refer alternative data structures that offer functionalities that can either mitigate some limitations of Bloom filters or serve entirely different objectives. This study explores the use of Bloom filters in molecular databases. To provide further context, Table 1 presents well-known chemistry databases, their approximate number of compounds, storage size required for text (SMILES) representation, and a comparison with a Bloom filter designed to store an equivalent number of molecules. In this study, we built different bloom filters using the Coconut database to compare the effectiveness of structure-based hashing with string hashes in the Bloom filter we demonstrate that string hashing consistently outperforms its counterpart. Traditionally, molecules have been represented using structure-based fingerprints. As underscored by the Bloom Filter principle, ”Wherever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated.”

A real-life example can be found in ChemCrow, where bloom filters are used in the molecule recommendation setting, making sure recommended chemicals are purchasable without intensive memory requirement.

Concrete examples of the usage of bloom filters in chemistry workflows include exploration of the chemical space while asserting either commercial availability or neglecting patented chemicals, without needing memory intensive databases or external server dependencies. Over time, the scope of their applications broadened to encompass web searches such as Google Chrome’s former implementation of a Bloom filter to detect malicious URLs, among other use cases. Originally applied in dictionaries and spell checkers, Bloom filters allowed for the quick identification of words within a given vocabulary, where the only significant drawback was with fake positives when misspelled words were labeled as being correct. This allows Bloom filters to conduct set membership tests with low false positive rates while utilizing less time and space compared to traditional data retrieval techniques. At its core, the Bloom filter utilizes a fixed-size ( m) bit array to represent n elements, employing k hash functions to map each element to k positions within the array. First proposed by Burton H. Bloom, this data structure has demonstrated exceptional value for large datasets, where traditional set membership testing methods would be excessively time-consuming. The Bloom filter, a space-efficient and probabilistic data structure, was designed to ascertain whether an element belongs to a specific set. To address this, Bloom filters can compact any database just for membership verification. With the growing scale of molecular screening, which now involves searching through billions of chemical structures, the processing times for querying extensive compound datasets have significantly increased.

0 Comments

Bloom filter medicine

Leave a Reply.

Author

Archives

Categories