Zhiyong Peng , Yunjun Gao , Guoliang Li , Jianqiu Xu
2024, 14(1):1-4. DOI: 10.21655/ijsi.1673-7288.00317
Abstract:Preface
Shuangshuang Cui , Xian Wu , Hongzhi Wang , Hao Wu
2024, 14(1):5-29. DOI: 10.21655/ijsi.1673-7288.00319
Abstract:In the cloud-edge-device collaboration architecture, data types are diverse, and there are differences in storage resources and computing resources at all levels, which brings new challenges to data management. The existing data models or simple superposition of data models are difficult to meet the requirements of multimodal data management and collaborative management in the cloud-edge-device. Therefore, research on multimodal data modeling technology for cloud-edge-device collaboration has become an important issue. The core is how to efficiently obtain the query results that meet the needs of the application from the cloud-edge-device architecture. Starting from the data types of the three-layer data of cloud-edge-device, in this paper we propose a multimodal data modeling technology for cloud-edge-device collaboration, give the definition of multimodal data model based on tuples, and design six base classes to achieve a unified representation of multimodal data. The basic data operation architecture of cloud-edge-device collaborative query is also proposed to meet the query requirements of cloud-edge-device business scenarios. The integrity constraints of the multimodal data model are given, which lays a theoretical foundation for query optimization. Finally, a demonstration application of the multimodal data model for cloud-edge-device collaboration is given, and the proposed data model storage method is verified from the three aspects of data storage time, storage space, and query time. The experimental results show that the proposed scheme can effectively represent the multimodal data in the cloud-edge-device collaboration architecture.
Wendi He , Tianrui Xia , Shaoxu Song , Xiangdong Huang , Jianmin Wang
2024, 14(1):31-56. DOI: 10.21655/ijsi.1673-7288.00322
Abstract:Time-series data are widely used in industrial manufacturing, meteorology, ships, electric power, vehicles, finance, and other fields, which promote the booming development of time-series database management systems. Faced with larger data scales and more diverse data modalities, efficiently storing and managing the data is very critical, and data encoding and compression are more and more important and worth studying. Existing data encoding methods and systems fail to consider the characteristics of data in different modalities thoroughly, and some methods of time-series data analysis have not been applied to the problem of data encoding. We comprehensively introduce the multimodal data encoding methods and their system implementation in the Apache IoTDB time-series database system, especially for the Internet of Things application scenarios. Our encoding method comprehensively considers data in multiple models including timestamp data, numerical data, Boolean data, frequency domain data, and text data, and fully explores and utilizes the characteristics of the corresponding modal of data, especially the characteristics of timestamp intervals approximation in timestamp modality, to carry out targeted data encoding design. At the same time, the data quality issue that may occur in practical applications has been taken into consideration in the encoding algorithm. Experimental evaluation and analysis at the encoding algorithm level and the system level over multiple datasets validate the effectiveness of our encoding method and its system implementation.
Yupeng Xie , Yuyu Luo , Jianhua Feng
2024, 14(1):57-72. DOI: 10.21655/ijsi.1673-7288.00321
Abstract:With the advent of the big data era, the significance of data analysis has increasingly come to the forefront, showcasing its ability to uncover valuable insights from vast datasets, thereby enhancing the decision-making process for users. Nonetheless, the data analysis workflow faces three dominant challenges: high coupling in the analysis workflow, a plethora of interactive interfaces, and a time-intensive exploratory analysis process. To address these challenges, we introduce with this paper Navi, a data analysis system powered by natural language interaction. Navi embraces a modular design philosophy that abstracts three core functional modules from mainstream data analysis workflows: data querying, visualization generation, and visualization exploration. This approach effectively reduces the coupling of the system. Meanwhile, Navi leverages natural language as a unified interactive interface to seamlessly integrate various functional modules through a task scheduler, ensuring their effective collaboration. Moreover, in order to address the challenges of exponential search space and ambiguous user intent in visualization exploration, we propose an automated approach for visualization exploration based on Monte Carlo tree search. In addition, a pruning algorithm and a composite reward function, both incorporating visualization domain knowledge, are devised to enhance the search efficiency and result quality. Finally, we validate the effectiveness of Navi through both quantitative experiments and user studies.
Tianming Zhang , Shan Zhang , Xi Liu , Bin Cao , Jing Fan
2024, 14(1):73-96. DOI: 10.21655/ijsi.1673-7288.00318
Abstract:As a crucial subtask in Natural Language Processing (NLP), Named Entity Recognition (NER) aims to extract import information from text, which can help many downstream tasks such as machine translation, text generation, knowledge graph construction, and multimodal data fusion to deeply understand the complex semantic information of the text and effectively complete these tasks. In practice, due to time and labor costs, NER suffers from annotated data scarcity, known as few-shot NER. Although few-shot NER methods based on text have achieved good generalization performance, the semantic information that the model can extract is still limited due to the few samples, which leads to the poor prediction effect of the model. To this end, in this paper we propose a few-shot NER model based on multimodal data fusion, which provides additional semantic information with multimodal data for the first time, to help the model prediction and can further effectively improve the effect of multimodal data fusion and modeling. This method converts image information into text information as auxiliary modality information, which effectively solves the problem of poor modality alignment caused by the inconsistent granularity of semantic information contained in text and images. In order to effectively consider the label dependencies in few-shot NER, we use the CRF framework and introduce the state-of-the-art meta-learning methods as the emission module and the transition module. To alleviate the negative impact of noise samples in the auxiliary modal samples, we propose a general denoising network based on the idea of meta-learning. The denoising network can measure the variability of the samples and evaluate the beneficial extent of each sample to the model. Finally, we conduct extensive experiments on real unimodal and multimodal datasets. The experimental results show the outstanding generalization performance of the proposed method, where our method outperforms the state-of-the-art methods by 10 F1 scores in the 1-shot scenario.
Zijun Chen , Delong Ma , Yishu Wang , Ye Yuan
2024, 14(1):97-117. DOI: 10.21655/ijsi.1673-7288.00320
Abstract:Personalized PageRank, as a basic algorithm in large graph analysis, has a wide range of applications in search engines, social recommendation, community detection, and other fields and it has been a hot problem of interest to researchers. The existing distributed personalized PageRank algorithms assume that all data are located in the same geographic location and the network environment is the same among the computing nodes where the data are located. However, in the real world, these data may be distributed in multiple data centers across continents, and these geo-distributed data centers are connected to each other through WANs, which are characterized by heterogeneous network bandwidth, huge hardware differences, and high communication costs. Moreover, the distributed personalized PageRank algorithm requires multiple iterations and random walk on the global graph. Therefore, the existing distributed personalized PageRank algorithms are not applicable to the geo-distributed environment. To address this problem, the GPPR (Geo-distributed Personalized PageRank) algorithm is proposed in this paper. The algorithm first preprocesses the big graph data in the geo-distributed environment and maps the graph data by using a heuristic algorithm to reduce the impact of network bandwidth heterogeneity on the iteration speed of the algorithm. Secondly, GPPR improves the random walk approach and proposes a probability-based push algorithm to further lower the number of iterations required by the algorithm by reducing the bandwidth load of data transmission between working nodes. We implement the GPPR algorithm based on the Spark framework and build a real geo-distributed environment in AliCloud to conduct experiments comparing the GPPR algorithm with several existing representative distributed personalized PageRank algorithms on eight open-source big graph datasets. The results show that the communication data volume of GPPR is reduced by 30% on average in the geo-distributed environment compared with that of other algorithms. In terms of algorithm running efficiency, GPPR improves by an average 2.5 factor compared with other algorithms.