With the rapid development of information technology, the volume of data maintains exponential growth, and the value of data is hard to mine. This brings significant challenges to the efficient management and control of each link in the data life cycle, such as data collection, cleaning, storage, and sharing. Sketch uses a hash table/matrix/bit vector to track the core characteristics of data, such as frequency, cardinality, and membership. This mechanism makes the sketch itself metadata, which has been widely used in sharing, transmission, update, and other scenarios. The rapid flow characteristic of big data has spawned dynamic sketches. The existing dynamic sketches have the advantage of expanding or shrinking the capacity with the size of the data stream by dynamically maintaining a list of probabilistic data structures in a chain or tree structure. However, there are problems with the excessive space overhead and time overhead increasing with the increase in the dataset cardinality. This paper designs a dynamic sketch for big data governance on the basis of the advanced jump consistent hash. This method can simultaneously achieve the space overhead that grows linearly with the dataset cardinality and the constant time overhead of data processing and analysis, effectively supporting the demanding big data processing and analysis tasks for big data governance. The validity and efficiency of the proposed method are verified by the comparison with traditional methods on various synthetic and natural datasets.
Pengtao Fu, Lailong Luo, Deke Guo, Xiang Zhao, Shangsen Li, Huaimin Wang. Jump Filter: A Dynamic Sketch for Big Data Governance. International Journal of Software and Informatics, 2023,13(1):57~85Copy