2021, 11(1):1-4. DOI: 10.21655/ijsi.1673-7288.00244
Abstract:In recent years, the data management and analysis technique supporting Artificial Intelligence (AI) has become one of the hot issues in the field of big data and AI. Using and developing theory and technology of data management and analysis provide a basic support for improving the efficiency and effectiveness of the life cycle of AI systems and will surely further promote the development of AI technology based on big data and its wider application. In particular, AI technology represented by machine learning extracts knowledge by modeling data, and one training process includes multiple sub-processes such as data selection, feature extraction, algorithm selection, hyper-parameter tuning and effect evaluation. After the effect evaluation is obtained at the end of the training, it is usually necessary to manually analyze model effect to mine the relationship of model effect with data, features and algorithms, and the training sub-processes are adjusted and iterated for multiple rounds based on data analysis and artificial experience. Apparently, machine learning tasks are much more complicated than query and analysis tasks of database systems. Due to the large number of training sub-processes and iteration adjustments of machine learning and many sub-processes requiring manual participation, the training process is still task-oriented, and the training sub-process is customized and optimized according to the features of the task. This approach has a high cost of labor participation and cannot reuse resources such as data, features and models between multiple tasks. Therefore, there are problems of high cost, low efficiency and high energy consumption. Then how to reduce the management cost in AI computing processes such as machine learning to improve its intelligent computing efficiency has become a core challenge in this field.
2021, 11(1):5-28. DOI: 10.21655/ijsi.1673-7288.00242
Abstract:Compared with conventional graph data analysis methods, the graph embedding algorithm provides a new graph data analysis strategy. It aims to encode graph nodes into vectors to mine or analyze graph data more effectively using neural network related technologies. Some classic tasks have been improved significantly by graph embedding methods, such as node classification, link prediction, and traffic flow prediction. Although substantial breakthroughs have been made by former researchers in graph embedding, the nodes embedding problem over temporal graph has been seldom studied. In this study, we propose an adaptive temporal graph embedding (ATGED), attempting to encode temporal graph nodes into vectors by combining previous research and the information propagation characteristics. First, an adaptive cluster method is proposed by solving the situation that nodes active frequency varies types of graph. Then, a new node walk strategy is designed in order to store the time sequence between nodes, and also the walking list will be stored in a bidirectional multi-tree in the walking process to get complete walking lists fast. Last, based on the basic walking characteristics and graph topology, an important node sampling strategy is proposed to train the satisfied neural network as soon as possible. Sufficient experiments demonstrate that the proposed method surpasses existing embedding methods in terms of node clustering, reachability prediction, and node classification in temporal graphs.
2021, 11(1):29-54. DOI: 10.21655/ijsi.1673-7288.00239
Abstract:As the basis of data management and analysis, data quality issues have increasingly become a research hotspot in related fields, which contributes to optimization of big data and artificial intelligence technology. Generally, physical failures or technical defects in data collectors and recorders cause anomalies in collected data. These anomalies will strongly impact on subsequent data analysis and artificial intelligence processes; thus, data should be processed and cleaned accordingly before application. Existing repairing methods based on smoothing will cause a large number of originally correct data points being over-repaired into wrong values. The constraint-based methods such as sequential dependency and SCREEN cannot accurately repair data under complex conditions since the constraints are relatively simple. A time series data repairing method under multi-speed constraints is further proposed based on the principle of minimum repairing. Then, dynamic programming is used to solve the problem of data anomalies with optimal repairing. Specifically, multiple speed intervals are set to constrain time series data, and a series of candidate repairing points are formed for each data point according to the speed constraints. Next, the optimal repair solution is selected from these candidates based on the dynamic programming method. With regard to the feasibility study of this method, an artificial dataset, two real datasets, and another real dataset with real anomalies are employed for experiments in case of different rates of anomalies and data sizes. Experimental results demonstrate that, compared with the existing methods based on smoothing or constraints, the proposed method has better performance in terms of RMS errors and time cost. In addition, the investigation of clustering and classification accuracy with several datasets reveals the impact of data quality on subsequent data analysis and artificial intelligence. The proposed method can improve the quality of data analysis and artificial intelligence results.
2021, 11(1):55-67. DOI: 10.21655/ijsi.1673-7288.00240
Abstract:Recently, unsupervised Hashing has attracted much attention in the machine learning and information retrieval communities, due to its low storage and high search efficiency. Most of existing unsupervised Hashing methods rely on the local semantic structure of the data as the guiding information, requiring to preserve such semantic structure in the Hamming space. Thus, how to precisely represent the local structure of the data and Hashing code s becomes the key point to success. This study proposes a novel Hashing method based on self-supervised learning. Specifically, it is proposed to utilize the contrast learning to acquire a compact and accurate feature representation for each sample, and then a semantic structure matrix can be constructed for representing the similarity between samples. Meanwhile, a new loss function is proposed to preserve the semantic information and improve the discriminative ability in the Hamming space, by the spirit of the instance discrimination method proposed recently. The proposed framework is end-to-end trainable. Extensive experiments on two large-scale image retrieval data sets show that the proposed method can significantly outperform current state-of-the-art methods.
2021, 11(1):69-90. DOI: 10.21655/ijsi.1673-7288.00241
Abstract:With the development of big data application, the demand of large-scale structured/unstructured data fusion management and analysis is becoming increasingly prominent. However, the differences in management, process, retrieval of structured/unstructured data brings challenges for fusion management and analysis. This study proposes an extended property graph model for heterogeneous data fusion management and semantic computing, and defines related property operators and query syntax. Based on the intelligent property graph model, this study implements PandaDB, an intelligent fusion management system for heterogeneous data. This study depicts the architecture, storage mechanism, query mechanism, property co-storage, AI algorithm scheduling, and distributed architecture of PandaDB. Test experiments and cases show that the co-storage mechanism and distributed architecture of PandaDB have good performance acceleration effects, and can be applied in some scenarios of fusion data intelligent management such as entity disambiguation of academic knowledge graph.
2021, 11(1):91-116. DOI: 10.21655/ijsi.1673-7288.00243
Abstract:Knowledge graph is an important cornerstone of artificial intelligence, which currently has two main data models: RDF graphs and property graphs. There are several query languages on these two data models, including SPARQL on RDF graphs and Cypher on property graphs. Over the last decade, various communities have developed different data management methods for RDF graphs and property graphs. Inconsistent data models and query languages hinder the wider application of knowledge graphs. In this paper, we propose a knowledge graphy database (KGDB) system with unified data model and query language. (1) We work out a unified storage scheme based on the relational model that supports the efficient storage of RDF graphs and property graphs, catering to the smooth storage and query of knowledge graph data. (2) The characteristic set-based clustering is used in KGDB for the storage of typeless entities. (3) It realizes the interoperability of SPARQL and Cypher by enabling them to operate on the same knowledge graph. Extensive experiments on real-world datasets and synthetic datasets reveal that KGDB is more efficient than existing knowledge graph database management systems in storage management and query efficiency. KGDB saves 30% of the storage space on average compared with gStore and Neo4j. In addition, KDGB is two orders of magnitude faster than gStore and Neo4j in the query of the real-world datasets, seen from experiments on the query of basic graph pattern matching.