2021, 11(2):117-119. DOI: 10.21655/ijsi.1673-7288.00245
Abstract:Preface: Special Issue on Advances in System Software
Fengjuan Gao , Yu Wang , Tianjiao Chen , Lingyun Situ , Linzhang Wang , Xuandong Li
2021, 11(2):121-147. DOI: 10.21655/ijsi.1673-7288.00246
Abstract:During the rapid development of mobile computing, IoT, cloud computing, artificial intelligence, etc., many new programming languages and compilers are emerging. Nevertheless, C/C++ is still one of the most popular languages, and the array is one of the most important data structures of C language. It is necessary to check whether the index is within the boundary of the array when it is used to access the element of an array in a program. Otherwise, array index out-of-bounds will happen unexpectedly. When array index out-of-bounds defects are in programs, some serious errors may occur during execution, such as system crash. It is even worse that array index out-of-bounds defects open the doors for attackers to take control of the server and execute arbitrary malicious codes by carefully constructing input and intercepting the control flow of the programs. Existing static methods for array boundary checking cannot achieve high accuracy and deal with complex constraints and expressions, leading to massive false positives. In addition, it will increase the burden of developers. In this study, a static checking method is proposed based on taint analysis. First, a flow-sensitive, context-sensitive, and on-demand pointer analysis is proposed to analyze the range of array length. Then, an on-demand taint analysis is performed for all array indices and array length expressions. Finally, the rules are defined for checking array index out-of-bounds defects and the checking is realized based on backward data flow analysis. During the analysis, in light of complex constraints and expressions, it is proposed to check the satisfiability of the conditions by invoking the constraint solver. If none statement for avoiding array index out-of-bounds is found in the program, an array index out-of-bound warning will be reported. An automatic static analysis tool, Carraybound, has been implemented, and the experimental results show that Carraybound can work effectively and efficiently.
Mengwei Xu , Yuanqiang Liu , Kang Huang , Xuanzhe Liu , Gang Huang
2021, 11(2):149-168. DOI: 10.21655/ijsi.1673-7288.00247
Abstract:How to efficiently deploy machine learning models on mobile devices has drawn a lot of attention in both academia and industries, among which the model training is a critical part. However, with increasingly public attention on data privacy and the recently adopted laws and regulations, it becomes harder for developers to collect training data from users and thus they cannot train high-quality models. Researchers have been exploring approaches to training neural networks on decentralized data. Those efforts will be summarized and their limitations be pointed out. To this end, this work presents a novel neural network training paradigm on mobile devices, which distributes all training computations associated with private data on local devices and requires no data to be uploaded in any form. Such training paradigm is named autonomous learning. To deal with two main challenges of autonomous learning, i.e., limited data volume and insufficient computing power available on mobile devices, this paper designs and implements the first autonomous learning system AutLearn. It incorporates the cloud (public data and pre-training)--client (private data and transfer learning) cooperation methodology and data augmentation techniques to ensure the model convergence on mobile devices. Furthermore, by optimization techniques such as model compression, neural network compiler, and runtime cache reuse, AutLearn can significantly reduce the on-client training cost. Two classical scenarios of autonomous learning are implemented based on AutLearn,with a set of experiments carried out. The results show that AutLearn can train the neural networks with comparable or even higher accuracy compared to traditional centralized/federated training mode with privacy preserved. AutLearn can also remarkably cut the computational and energy cost of neural network training on mobile devices.
Dingji Li , Zeyu Mi , Baodong Wu , Xun Chen , Yongwang Zhao , Zuohua Ding , Haibo Chen
2021, 11(2):169-193. DOI: 10.21655/ijsi.1673-7288.00248
Abstract:The increasing deployment of artificial intelligence has placed unprecedent requirements on the computing power of cloud computing. Cloud service providers have integrated accelerators with massive parallel computing units in the data center. These accelerators need to be combined with existing virtualization platforms to partition the computing resources. The current mainstream accelerator virtualization solution is through the PCI passthrough approach, which however does not support fine-grained resource provisioning. Some manufacturers also start to provide time-sliced multiplexing schemes and use drivers to cooperate with specific hardware to divide resources and time slices to different virtual machines, which unfortunately suffer from poor portability and flexibility. One alternative but promising approach is based on API forwarding, which forwards the virtual machine's request to the back-end driver for processing through a separate driver model. Yet, the communication due to API forwarding can easily become the performance bottleneck. This paper proposes Wormhole, an accelerator virtualization framework based on the C/S architecture that supports rapid delegated execution across virtual machines. It aims to provide upper-level users with an efficient and transparent way to accelerate the virtualization of accelerators with API forwarding while ensuring strong isolation between multiple users. By leveraging hardware virtualization feature, the framework minimizes performance degradation through exitless inter-VM control flow switch. Experimental results show that Wormhole's prototype system can achieve up to 5 times performance improvement over the traditional open-source virtualization solution such as GVirtuS in the training test of the classic model.
Hongjun Zhang , Yanjun Wu , Heng Zhang , Libo Zhang
2021, 11(2):195-216. DOI: 10.21655/ijsi.1673-7288.00249
Abstract:Hash tables, as a type of data indexing structure that provides efficient data access based on key values, are widely used in various computer applications, especially in system software, databases, and high-performance computing field that requires extremely high performance. In network, cloud computing and IoT services, hash tables have become the core system components of cache systems. However, with the large-scale increase in the amount of large-scale data, performance bottlenecks have gradually emerged in systems designed with a multi-core CPU as the core of the hash table structure. There is an urgent need to further improve the high performance and scalability of the hash tables. With the increasing popularity of general-purpose Graphic Processing Units (GPUs) and the substantial improvement of hardware computing capabilities and concurrency performance, various types of system software tasks with parallel computing as the core have been optimized on the GPU and have achieved considerable performance promotion. Due to the sparseness and randomness, using the existing parallel structure of the hash tables directly on the GPUs will inevitably bring high-frequency memory access and frequent bus data transmission, which affects the performance of the hash tables on the GPUs. This study focuses on the analysis of memory access, hit ratio, and index overhead of hash table indexes in the cache system. A hybrid access cache indexing framework CCHT (Cache Cuckoo Hash Table) adapted to GPU is proposed and provided. The cache strategy suitable to different requirements of hit ratios and index overheads allows concurrent execution of write and query operations, maximizing the use of the computing performance and concurrency characteristics of GPU hardware, reducing memory access and bus transferring overhead. Through GPU hardware implementation and experimental verification, CCHT has better performance than other cache indexing hash tables while ensuring cache hit ratios.
Guanyu Liang , Yanjun Wu , Jingzheng Wu , Chen Zhao
2021, 11(2):217-241. DOI: 10.21655/ijsi.1673-7288.00250
Abstract:Software reliability is one of the research hotspots in the field of software engineering, and failure rate analysis is a typical research method for software reliability. However, the software construction mode has evolved from a single mode to a large-scale collaborative model represented by open source software. As one of the representative products, the operating system includes open source software connected through combinations and dependencies to form a supply network of tens of thousands of nodes. Typical methods lack consideration of supply relationships and cannot accurately identify and evaluate the software reliability issues introduced as a result. This paper extends the concept of supply chain to the field of open source software and proposes a knowledge-based management method for software supply reliability in a collaborative model. The ontological body is designed for the open source software ecosystem firstly, and then the nowledge graph of open source software is constructed to achieve the extraction, storage and management of knowledge; driven by knowledge, combined with traditional supply chain management methods, a set of reliability management methods for open source software supply chain is proposed, which constitutes a management system of open source software supply chain. With the construction of a Linux operating system distribution as an example, the experiment demonstrates how the open source software supply chain supports the reliability of the operating system. Results show that the open source software supply chain will help to clarify and evaluate the reliability risk of large complex system software.
Wenqi Lou , Chao Wang , Lei Gong , Xuehai Zhou
2021, 11(2):243-258. DOI: 10.21655/ijsi.1673-7288.00251
Abstract:In recent years, Convolutional Neural Network (CNN) has received widespread attention in the field of machine learning due to its high-accuracy performance in character recognition and image classification. Nevertheless, the compute-intensive and memory-intensive characteristics of CNN have posed huge challenges to the general-purpose processor, which needs to support various workloads. Therefore, a large number of CNN-specific hardware accelerators have emerged to improve efficiency. Though significantly efficient, previous accelerators are not flexible enough. In this study, classical CNN models are analyzed and a domain-specific instruction set of 10 matrix instructions, called RV-CNN, is design based on the promising RISC-V architecture. By abstracting CNN computation into instructions, the proposed design can provide sufficient flexibility for CNN and possesses a higher code density than the general ISA. On this basis, a code-to-instruction mapping mechanism is proposed. By using the RV-CNN to build different CNN models on the Xilinx ZC702, this paper found that compared to x86 processors, RV-CNN has an average of 141 times energy efficiency and 8.91 times the code density; compared to GPU, it has an average of 1.25 times energy efficiency and 1.95 times the code density. In addition, compared to previous CNN accelerators, the design supports typical CNN models while at high energy efficiency.