布隆过滤器

布隆过滤器(英语:Bloom Filter)是1970年由伯顿·霍华德·布隆(Burton Howard Bloom)提出的。[1]它实际上是一个很长的二进制向量和一系列随机映射函数。布隆过滤器可以用于检索一个元素是否在一个集合中。它的优点是空间效率和查询时间都远远超过一般的算法,缺点是有一定的误识别率和删除困难。
基本概念
[编辑]如果想判断一个元素是不是在一个集合里,一般想到的是将集合中所有元素保存起来,然后通过比较确定。链表、树、散列表(又叫哈希表,Hash table)等等数据结构都是这种思路。但是随着集合中元素的增加,我们需要的存储空间越来越大。同时检索速度也越来越慢,上述三种结构的检索时间复杂度分别为。
布隆过滤器的原理是,当一个元素被加入集合时,通过K个散列函数将这个元素映射成一个位数组中的K个点,把它们置为1。检索时,我们只要看看这些点是不是都是1就(大约)知道集合中有没有它了:如果这些点有任何一个0,则被检元素一定不在;如果都是1,则被检元素很可能在。这就是布隆过滤器的基本思想。所以布隆过滤器可能会产生假阳性(误报),但不会产生假阴性(漏报)。
算法描述
[编辑]
一个“空布隆过滤器”是一个由m位组成的位数组(英语:bit array),所有位都被设置为0。它配备了k个不同的散列函数,这些函数将集合元素映射到m个可能的数组位置之一。为了达到最佳效果,散列函数应为均匀分布且独立。通常,k是一个小的常量,它取决于期望的假阳性(误报)率ε,而m与k和要添加的元素数量成正比。
要“添加”一个元素,将其分别输入到k个散列函数中,以获得k个数组位置。将所有获得的位置的位设置为1。
要“检验”一个元素是否在集合中,将其输入到每个k散列函数中,以获得k个数组位置。如果这些位置“存在”位为0的位置,则该元素一定不在集合中;如果它在集合中,那么当它被插入时,所有位都应该已经是1。如果所有位都已为1,说明该元素可能在集合中,又或许这些位是在插入其他元素时碰巧被设置为1,从而导致假阳性。就一个简单的布隆过滤器而言,并不能区分这两种情况,但更先进的技术可以解决这个问题。
设计k个不同的独立散列函数的要求对于大的k可能是难以实现的。对于一个具有宽输出的良好散列函数,这种散列的不同位域之间应该几乎没有相关性,因此这种散列类型可以用于通过将其输出切片成多个位域来生成多个“不同”的散列函数。或者,可以将k个不同的初始值(例如0, 1, ..., k − 1)传递给一个接受初始值的散列函数;或者将这些值加入(或追加)到键。对于较大的m和/或k,散列函数之间的独立性可以放宽,而假阳性率的增加可以忽略不计。[2](具体而言,Dillinger & Manolios (2004b)展示了使用增强双重散列和三重散列(双散列的变体,实际上是用两个或三个散列值播种的简单随机数生成器)导出k个索引的有效性。)
这样简单的布隆过滤器无法移除元素,因为无法得知它映射到的k位中的哪些位应该被移除。虽然将这些k位中的任何一位设置为零足以移除该元素,但它也会移除任何恰好映射到该位的其他元素。由于简单的算法没有提供任何方法来确定是否已添加任何其他影响要移除元素的位的元素,因此清除任何位都会引入假阴性(漏报)的可能性。
若要模拟从布隆过滤器中一次性移除元素的操作,可以引入一个辅助布隆过滤器(“移除过滤器”),用于存储已移除的元素。 然而,第二个过滤器中的假阳性会变成复合过滤器(“原过滤器”与“移除过滤器”的联合体)中的假阴性,这是不被希望遇到的。在这种方法中,却无法重新添加先前被移除的元素,因为还须将其从“移除过滤器”中移除,这又会回到最初的问题。
常见的情况是,所有键(待过滤的所有元素)都能够被获取(可用),但枚举它们的代价较高(例如,需要多次的硬盘读取)。当假阳性率变得太高时,可以重新生成过滤器;但此类事件应该是相对罕见的。
优点
[编辑]相比于其它的数据结构,布隆过滤器在空间和时间方面都有巨大的优势。布隆过滤器存储空间和插入/查询时间都是常数()。另外,散列函数相互之间没有关系,方便由硬件并行实现。布隆过滤器不需要存储元素本身,在某些对保密要求非常严格的场合有优势。
布隆过滤器可以表示全集,其它任何数据结构都不能;
和相同,使用同一组散列函数的两个布隆过滤器的交并[来源请求]运算可以使用位操作进行。
缺点
[编辑]但是布隆过滤器的缺点和优点一样明显。误算率是其中之一。随着存入的元素数量增加,误算率随之增加。但是如果元素数量太少,则使用散列表足矣。
另外,一般情况下不能从布隆过滤器中删除元素。我们很容易想到把位数组变成整数数组,每插入一个元素相应的计数器加1, 这样删除元素时将计数器减掉就可以了。然而要保证安全地删除元素并非如此简单。首先我们必须保证删除的元素的确在布隆过滤器里面。这一点单凭这个过滤器是无法保证的。另外计数器回绕也会造成问题。
在降低误算率方面,有不少工作,使得出现了很多布隆过滤器的变种。
参考
[编辑]引用
[编辑]文献
[编辑]- Agarwal, Sachin; Trachtenberg, Ari. Approximating the number of differences between remote sets. 2006 IEEE Information Theory Workshop (PDF). Punta del Este, Uruguay. 2006: 217. CiteSeerX 10.1.1.69.1033
. ISBN 978-1-4244-0035-5. S2CID 2048278. doi:10.1109/ITW.2006.1633815.
- Ahmadi, Mahmood; Wong, Stephan, A Cache Architecture for Counting Bloom Filters, 15th international Conference on Networks (ICON-2007): 218, 2007, CiteSeerX 10.1.1.125.2470
, ISBN 978-1-4244-1229-7, S2CID 2967865, doi:10.1109/ICON.2007.4444089
- Almeida, Paulo; Baquero, Carlos; Preguica, Nuno; Hutchison, David, Scalable Bloom Filters (PDF), Information Processing Letters, 2007, 101 (6): 255–261, doi:10.1016/j.ipl.2006.10.007, hdl:1822/6627
- Apache Software Foundation, 11.6. Schema Design, The Apache HBase Reference Guide, Revision 0.94.27, 2012
- Bloom, Burton H., Space/Time Trade-offs in Hash Coding with Allowable Errors, Communications of the ACM, 1970, 13 (7): 422–426, CiteSeerX 10.1.1.641.9096
, S2CID 7931252, doi:10.1145/362686.362692
- Blustein, James; El-Maazawi, Amal, optimal case for general Bloom filters, Bloom Filters — A Tutorial, Analysis, and Survey, Dalhousie University Faculty of Computer Science: 1–31, 2002
- Boldi, Paolo; Vigna, Sebastiano, Mutable strings in Java: design, implementation and lightweight text-search algorithms, Science of Computer Programming, 2005, 54 (1): 3–23, doi:10.1016/j.scico.2004.05.003
- Bonomi, Flavio; Mitzenmacher, Michael; Panigrahy, Rina; Singh, Sushil; Varghese, George, An Improved Construction for Counting Bloom Filters, Algorithms – ESA 2006, 14th Annual European Symposium (PDF), Lecture Notes in Computer Science 4168: 684–695, 2006, ISBN 978-3-540-38875-3, doi:10.1007/11841036_61
- Broder, Andrei; Mitzenmacher, Michael, Network Applications of Bloom Filters: A Survey (PDF), Internet Mathematics, 2005, 1 (4): 485–509, S2CID 1560675, doi:10.1080/15427951.2004.10129096
- Byers, John W.; Considine, Jeffrey; Mitzenmacher, Michael; Rost, Stanislav, Informed content delivery across adaptive overlay networks, IEEE/ACM Transactions on Networking, 2004, 12 (5): 767, CiteSeerX 10.1.1.207.1563
, S2CID 47088273, doi:10.1109/TNET.2004.836103
- Calderoni, Luca; Palmieri, Paolo; Maio, Dario, Location privacy without mutual trust: The spatial Bloom filter (PDF), Computer Communications, 2015, 68: 4–16, ISSN 0140-3664, doi:10.1016/j.comcom.2015.06.011, hdl:10468/4762
- Calderoni, Luca; Palmieri, Paolo; Maio, Dario, Probabilistic Properties of the Spatial Bloom Filters and Their Relevance to Cryptographic Protocols, IEEE Transactions on Information Forensics and Security, 2018, 13 (7): 1710–1721, ISSN 1556-6013, S2CID 3693354, doi:10.1109/TIFS.2018.2799486, hdl:10468/5767
- Chang, Fay; Dean, Jeffrey; Ghemawat, Sanjay; Hsieh, Wilson; Wallach, Deborah; Burrows, Mike; Chandra, Tushar; Fikes, Andrew; Gruber, Robert, Bigtable: A Distributed Storage System for Structured Data, Seventh Symposium on Operating System Design and Implementation, 2006
- Charles, Denis Xavier; Chellapilla, Kumar, Bloomier filters: A second look, Halperin, Dan; Mehlhorn, Kurt (编), Algorithms: ESA 2008, 16th Annual European Symposium, Karlsruhe, Germany, September 15–17, 2008, Proceedings, Lecture Notes in Computer Science 5193, Springer: 259–270, 2008, ISBN 978-3-540-87743-1, S2CID 643445, arXiv:0807.0928
, doi:10.1007/978-3-540-87744-8_22
- Chazelle, Bernard; Kilian, Joe; Rubinfeld, Ronitt; Tal, Ayellet, The Bloomier filter: an efficient data structure for static support lookup tables, Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (PDF): 30–39, 2004
- Cohen, Saar; Matias, Yossi, Spectral Bloom Filters, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (PDF): 241–252, 2003 [2019-10-24], ISBN 978-1581136340, S2CID 1058187, doi:10.1145/872757.872787, (原始内容 (PDF)存档于2021-03-10)
- Deng, Fan; Rafiei, Davood, Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters, Proceedings of the ACM SIGMOD Conference (PDF): 25–36, 2006
- Dharmapurikar, Sarang; Song, Haoyu; Turner, Jonathan; Lockwood, John, Fast packet classification using Bloom filters, Proceedings of the 2006 ACM/IEEE Symposium on Architecture for Networking and Communications Systems (PDF): 61–70, 2006, CiteSeerX 10.1.1.78.9584
, ISBN 978-1595935809, S2CID 7848110, doi:10.1145/1185347.1185356, (原始内容 (PDF)存档于2007-02-02)
- Dietzfelbinger, Martin; Pagh, Rasmus, Succinct data structures for retrieval and approximate membership, Aceto, Luca; Damgård, Ivan; Goldberg, Leslie Ann; Halldórsson, Magnús M.; Ingólfsdóttir, Anna; Walukiewicz, Igor (编), Automata, Languages and Programming: 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7–11, 2008, Proceedings, Part I, Track A: Algorithms, Automata, Complexity, and Games, Lecture Notes in Computer Science 5125, Springer: 385–396, 2008, ISBN 978-3-540-70574-1, S2CID 1699996, arXiv:0803.3693
, doi:10.1007/978-3-540-70575-8_32
- Dillinger, Peter C.; Manolios, Panagiotis, Fast and Accurate Bitstate Verification for SPIN, Proceedings of the 11th International Spin Workshop on Model Checking Software, Springer-Verlag, Lecture Notes in Computer Science 2989, 2004a
- Dillinger, Peter C.; Manolios, Panagiotis, Bloom Filters in Probabilistic Verification, Proceedings of the 5th International Conference on Formal Methods in Computer-Aided Design, Springer-Verlag, Lecture Notes in Computer Science 3312, 2004b
- Donnet, Benoit; Baynat, Bruno; Friedman, Timur, Retouched Bloom Filters: Allowing Networked Applications to Flexibly Trade Off False Positives Against False Negatives, CoNEXT 06 – 2nd Conference on Future Networking Technologies, 2006, (原始内容存档于2009-05-17)
- Eppstein, David; Goodrich, Michael T., Space-efficient straggler identification in round-trip data streams via Newton's identities and invertible Bloom filters, Algorithms and Data Structures, 10th International Workshop, WADS 2007, Lecture Notes in Computer Science 4619, Springer-Verlag: 637–648, 2007, Bibcode:2007arXiv0704.3313E, arXiv:0704.3313
- Fan, Bin; Andersen, Dave G.; Kaminsky, Michael; Mitzenmacher, Michael D., Cuckoo filter: Practically better than Bloom, Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies: 75–88, 2014, ISBN 9781450332798, doi:10.1145/2674005.2674994
. Open source implementation available on github.
- Fan, Li; Cao, Pei; Almeida, Jussara; Broder, Andrei, Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol (PDF), IEEE/ACM Transactions on Networking, 2000, 8 (3): 281–293 [2018-07-30], CiteSeerX 10.1.1.41.1487
, S2CID 4779754, doi:10.1109/90.851975, (原始内容 (PDF)存档于2017-09-22). A preliminary version appeared at SIGCOMM '98.
- Goel, Ashish; Gupta, Pankaj, Small subset queries and bloom filters using ternary associative memories, with applications (PDF), ACM SIGMETRICS Performance Evaluation Review, 2010, 38: 143, CiteSeerX 10.1.1.296.6513
, doi:10.1145/1811099.1811056
- Graf, Thomas Mueller; Lemire, Daniel, Xor Filters, ACM Journal of Experimental Algorithmics, 2020, 25: 1–16, Bibcode:2019arXiv191208258M, S2CID 209405019, arXiv:1912.08258
, doi:10.1145/3376122
- Grandi, Fabio, On the analysis of Bloom filters (PDF), Information Processing Letters, 2018, 129: 35–39, doi:10.1016/j.ipl.2017.09.004
- Kirsch, Adam; Mitzenmacher, Michael, Less Hashing, Same Performance: Building a Better Bloom Filter, Azar, Yossi; Erlebach, Thomas (编), Algorithms – ESA 2006, 14th Annual European Symposium (PDF), Lecture Notes in Computer Science 4168, Springer-Verlag, Lecture Notes in Computer Science 4168: 456–467, 2006, ISBN 978-3-540-38875-3, doi:10.1007/11841036, (原始内容 (PDF)存档于2009-01-31)
- Koucheryavy, Y.; Giambene, G.; Staehle, D.; Barcelo-Arroyo, F.; Braun, T.; Siris, V., Traffic and QoS Management in Wireless Multimedia Networks, COST 290 Final Report, 2009: 111
- Kubiatowicz, J.; Bindel, D.; Czerwinski, Y.; Geels, S.; Eaton, D.; Gummadi, R.; Rhea, S.; Weatherspoon, H.; et al, Oceanstore: An architecture for global-scale persistent storage (PDF), ACM SIGPLAN Notices, 2000: 190–201 [2011-12-01], (原始内容 (PDF)存档于2012-03-11)
- Maggs, Bruce M.; Sitaraman, Ramesh K., Algorithmic nuggets in content delivery (PDF), ACM SIGCOMM Computer Communication Review, July 2015, 45 (3): 52–66, CiteSeerX 10.1.1.696.9236
, S2CID 65760, doi:10.1145/2805789.2805800, (原始内容 (PDF)存档于2021-08-14)
- Mitzenmacher, Michael; Upfal, Eli, Probability and computing: Randomized algorithms and probabilistic analysis, Cambridge University Press: 107–112, 2005, ISBN 9780521835404
- Mortensen, Christian Worm; Pagh, Rasmus; Pătrașcu, Mihai, On dynamic range reporting in one dimension, Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing: 104–111, 2005, ISBN 978-1581139600, S2CID 56473, arXiv:cs/0502032
, doi:10.1145/1060590.1060606
- Mullin, James K., Optimal semijoins for distributed database systems, IEEE Transactions on Software Engineering, 1990, 16 (5): 558–560, doi:10.1109/32.52778
- Pagh, Anna; Pagh, Rasmus; Rao, S. Srinivasa, An optimal Bloom filter replacement, Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms (PDF): 823–829, 2005
- Palmieri, Paolo; Calderoni, Luca; Maio, Dario, Spatial Bloom Filters: Enabling Privacy in Location-Aware Applications, Proc. 10th International Conference on Information Security and Cryptology (Inscrypt 2014) 8957, Springer-Verlag, Lecture Notes in Computer Science: 16–36, 2014, CiteSeerX 10.1.1.471.4759
, ISBN 978-3-319-16744-2, doi:10.1007/978-3-319-16745-9_2
- Porat, Ely, An optimal Bloom filter replacement based on matrix solving, Frid, Anna E.; Morozov, Andrey; Rybalchenko, Andrey; Wagner, Klaus W. (编), Computer Science, Theory and Applications: Fourth International Computer Science Symposium in Russia, CSR 2009, Novosibirsk, Russia, August 18–23, 2009, Proceedings, Lecture Notes in Computer Science 5675, Springer: 263–273, 2009, ISBN 978-3-642-03350-6, S2CID 3205108, arXiv:0804.1845
, doi:10.1007/978-3-642-03351-3_25
- Pournaras, E.; Warnier, M.; Brazier, F. M. T., A generic and adaptive aggregation service for large-scale decentralized networks, Complex Adaptive Systems Modeling, 2013, 1 (19): 19, doi:10.1186/2194-3206-1-19
. Prototype implementation available on github.
- Putze, F.; Sanders, P.; Singler, J., Cache-, Hash- and Space-Efficient Bloom Filters, Demetrescu, Camil (编), Experimental Algorithms, 6th International Workshop, WEA 2007 (PDF), Lecture Notes in Computer Science 4525, Springer-Verlag, Lecture Notes in Computer Science 4525: 108–121, 2007 [2007-07-18], ISBN 978-3-540-72844-3, doi:10.1007/978-3-540-72845-0, (原始内容 (PDF)存档于2007-06-23)
- Rottenstreich, Ori; Kanizo, Yossi; Keslassy, Isaac, The Variable-Increment Counting Bloom Filter, 31st Annual IEEE International Conference on Computer Communications, 2012, Infocom 2012 (PDF): 1880–1888, 2012, CiteSeerX 10.1.1.174.7165
, ISBN 978-1-4673-0773-4, doi:10.1109/INFCOM.2012.6195563
- Sethumadhavan, Simha; Desikan, Rajagopalan; Burger, Doug; Moore, Charles R.; Keckler, Stephen W., Scalable hardware memory disambiguation for high ILP processors, 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003, MICRO-36 (PDF): 399–410, 2003, CiteSeerX 10.1.1.229.1254
, ISBN 978-0-7695-2043-8, S2CID 195881068, doi:10.1109/MICRO.2003.1253244, (原始内容 (PDF)存档于2007-01-14)
- Starobinski, David; Trachtenberg, Ari; Agarwal, Sachin, Efficient PDA Synchronization (PDF), IEEE Transactions on Mobile Computing, 2003, 2 (1): 40, CiteSeerX 10.1.1.71.7833
, doi:10.1109/TMC.2003.1195150
- Stern, Ulrich; Dill, David L., A New Scheme for Memory-Efficient Probabilistic Verification, Proceedings of Formal Description Techniques for Distributed Systems and Communication Protocols, and Protocol Specification, Testing, and Verification: IFIP TC6/WG6.1 Joint International Conference, Chapman & Hall, IFIP Conference Proceedings: 333–348, 1996, CiteSeerX 10.1.1.47.4101
- Wessels, Duane, 10.7 Cache Digests, Squid: The Definitive Guide 1st, O'Reilly Media: 172, January 2004, ISBN 978-0-596-00162-9,
Cache Digests are based on a technique first published by Pei Cao, called Summary Cache. The fundamental idea is to use a Bloom filter to represent the cache contents.
- Tarkoma, Sasu; Rothenberg, Christian Esteve; Lagerspetz, Eemil, Theory and practice of bloom filters for distributed systems, IEEE Communications Surveys & Tutorials, no. 1. (PDF) 14: 131–155, 2012
- Zhiwang, Cen; Jungang, Xu; Jian, Sun, A multi-layer Bloom filter for duplicated URL detection, Proc. 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE 2010) 1: V1–586–V1–591, 2010, ISBN 978-1-4244-6539-2, S2CID 3108985, doi:10.1109/ICACTE.2010.5578947
外部链接
[编辑]- Hash和Bloom Filter介绍 (页面存档备份,存于互联网档案馆)
- 布隆过滤器可视化页面: 可视化的位数组添加,直观理解布隆过滤器原理。