2023, 13(1):1-4. DOI: 10.21655/ijsi.1673-7288.00293
2023, 13(1):5-26. DOI: 10.21655/ijsi.1673-7288.00294
Abstract:Government data governance is undergoing a new phase of transition from ``physical data aggregation'' to ``logical semantic unification''. Thus far, long-term ``autonomy'' of government information silos, leads to a wide spectrum of metadata curation issues, such as attributes with the same names but having different meanings, or attributes with different names but having the same meanings. Instead of either rebuilding/modifying legacy information systems or physically aggregating data from government information silos, logical semantic unification solves this problem by unifying the semantic expression of the metadata in government information silos and achieves the standardized metadata governance. This paper focuses on the logical semantic unification that semantically aligns the metadata in each government information silo with the existing standard metadata. Specifically, the names of the standard metadata are abstracted as semantic labels, and the column projections of silo relational data are semantically recognized to semantically align column names with the standard metadata and ultimately achieve the standardized governance of silo metadata. The existing semantic recognition techniques based on column projection fail to capture the column order-independent features of relational data and the correlation features among attributes and semantic labels. To address the above problem, we propose a two-phase model based on a prediction phase and a correction phase. In the prediction phase, a Co-occurrence-Attribute-Interaction (CAI) model is proposed to guarantee the column order-independent property by employing the parallelized self-attention mechanism; in the correction phase, a correction mechanism is introduced to optimize the prediction results of the CAI model by utilizing the co-occurrence of semantic labels. Experiments are conducted on a government benchmark dataset and several public English datasets, such as Magellan, and the results show that the two-phase model with a correction mechanism outperforms the current optimal model in macro-average and weighted average by up to 20.03% and 13.36%, respectively.
2023, 13(1):27-55. DOI: 10.21655/ijsi.1673-7288.00295
Abstract:With the emergence and accumulation of massive data, data governance has become an important manner to improve data quality and maximize data value. Specifically, data error detection is a crucial step to improve data quality, which has attracted wide attention from both industry and academia. At present, various detection methods tailored for a single data source have been proposed. However, in many real-world scenarios, data are not centrally stored or managed. Data from different sources but highly correlated can be employed to improve the accuracy of error detection. Unfortunately, due to privacy/security issues, cross-source data are often not allowed to be integrated centrally. To this end, this paper proposes FeLeDetect, a cross-source data error detection method based on federated learning, so as to improve the error detection accuracy by using cross-source data information on the premise of data privacy. First, a Graph-based Error Detection Model, namely GEDM, is presented to capture sufficient data features from each data source. On this basis, the paper then designs a federated co-training algorithm, namely FCTA, to collaboratively train GEDM by using different cross-source data without privacy leakage of data. Furthermore, the paper designs a series of optimization methods to reduce communication costs during federated learning and manual labeling efforts. Finally, extensive experiments on three real-world datasets demonstrate that (1) GEDM achieves an average improvement of 10.3% and 25.2% in terms of the $F1$ score in the local and centralized scenarios, respectively, outperforming all the five existing state-of-the-art methods for error detection; (2) the F1 score of the error detection by FeLeDetect is 23.2% on average higher than that by GEDM in the local scenario.
2023, 13(1):57-85. DOI: 10.21655/ijsi.1673-7288.00296
Abstract:With the rapid development of information technology, the volume of data maintains exponential growth, and the value of data is hard to mine. This brings significant challenges to the efficient management and control of each link in the data life cycle, such as data collection, cleaning, storage, and sharing. Sketch uses a hash table/matrix/bit vector to track the core characteristics of data, such as frequency, cardinality, and membership. This mechanism makes the sketch itself metadata, which has been widely used in sharing, transmission, update, and other scenarios. The rapid flow characteristic of big data has spawned dynamic sketches. The existing dynamic sketches have the advantage of expanding or shrinking the capacity with the size of the data stream by dynamically maintaining a list of probabilistic data structures in a chain or tree structure. However, there are problems with the excessive space overhead and time overhead increasing with the increase in the dataset cardinality. This paper designs a dynamic sketch for big data governance on the basis of the advanced jump consistent hash. This method can simultaneously achieve the space overhead that grows linearly with the dataset cardinality and the constant time overhead of data processing and analysis, effectively supporting the demanding big data processing and analysis tasks for big data governance. The validity and efficiency of the proposed method are verified by the comparison with traditional methods on various synthetic and natural datasets.
2023, 13(1):87-115. DOI: 10.21655/ijsi.1673-7288.00297
Abstract:Big data has become a national basic strategic resource, and the opening and sharing of data is the core of China's big data strategy. Cloud native technology and lake-house architecture are reconstructing the big data infrastructure and promoting data sharing and value dissemination. The development of the big data industry and technology requires stronger data security and data sharing capabilities. However, data security in an open environment has become a bottleneck, which restricts the development and utilization of big data technology. The issues of data security and privacy protection have become increasingly prominent both in the open source big data ecosystem and the commercial big data system. Dynamic data protection system under the open big data environment is now facing challenges in regards such as data availability, processing efficiency, and system scalability. This paper proposes the dynamic data protection system BDMasker for the open big data environment. Through a precise query analysis and query rewriting technology based on the query dependency model, it can accurately perceive but does not change the original business request, which indicates that the whole process of dynamic masking has zero impact on the business. Furthermore, its multi-engine-oriented unified security strategy framework realizes the vertical expansion of dynamic data protection capabilities and the horizontal expansion among multiple computing engines. The distributed computing capability of the big data execution engine can be used to improve the data protection processing performance of the system. The experimental results show that the precise SQL analysis and rewriting technology proposed by BDMasker is effective. The system has good scalability and performance, and the overall performance fluctuates within 3% in the TPC-DS and YCSB benchmark tests.
2023, 13(1):117-137. DOI: 10.21655/ijsi.1673-7288.00298
Abstract:Recently, many countries and regions have enacted data security policies, such as the General Data Protection Regulation proposed by the EU. The release of related laws and regulations has aggravated the problem of data silos, which makes it difficult to share data among various data owners. Data federation is a possible solution to this problem. Data federation refers to the calculation of query tasks jointly performed by multiple data owners without original data leaks using privacy computing technologies such as secure multi-party computing. This concept has become a research trend in recent years, and a series of representative systems have been proposed, such as SMCQL and Conclave. However, for the core join queries in the relational database system, the existing data federation system still has the following problems. First of all, the join query type is single, which is difficult to meet the query requirements under complex join conditions. Secondly, the algorithm performance has huge improvement space because the existing systems often call the security tool library directly, which means the runtime and communication overhead is high. Therefore, this paper proposes a join algorithm under data federation to address the above issues. The main contributions of this paper are as follows: firstly, multi-party-oriented federation security operators are designed and implemented, which can support many operations. Secondly, a federated θ-join algorithm and an optimization strategy are proposed to significantly reduce the security computation cost. Finally, the performance of the algorithm proposed in this paper is verified by the benchmark dataset TPC-H. The experimental results show that the proposed algorithm can reduce the runtime and communication overhead by 61.33% and 95.26%, respectively, compared with the existing data federation systems SMCQL and Conclave.