
Yanjun Wu , Tao Xie , Rui Hou , Ke Zhang , Wei Song , Mingjie Xing
2025, 15(3):277-281. DOI: 10.21655/ijsi.1673-7288.00349
Abstract:Preface
Xuezheng Xu , Deheng Yang , Lu Wang , Tao Wang , Anwen Huang , Qiong Li
2025, 15(3):283-305. DOI: 10.21655/ijsi.1673-7288.00350
Abstract:The memory consistency model defines constraints on memory access orders for parallel programs on multi-core systems and is an important architectural specification that is jointly followed by software and hardware. Sequential consistency (SC) per location is one of the classic axioms of memory consistency model, which specifies that all memory operations with the same address in a multi-core system follow sequential consistency. It has been widely used in the memory consistency axiom model of classic architectures such as X86/TSO, Power, and ARM, and plays an important role in chip memory consistency verification, system software, and parallel program development. As an open-source architectural specification, the memory consistency model of RISC-V is defined by global memory order, preserved program order, and three axioms (load value axiom, atomicity axiom, and progress axiom). It does not directly include SC per location as an axiom, which poses challenges for existing memory consistency model verification tools and system software development. In this paper, we formalize the SC per location as a theorem based on the defined axioms and rules in the RISC-V memory consistency model. The proof process abstracts the construction of arbitrary same-address memory access sequences into deterministic finite automata for inductive proof. This research is a theoretical supplement to the formal methods of RISC-V memory consistency.
Yijin Li , Shaomin Du , Jiacheng Zhao , Xueying Wang , Yongquan Zha , Huimin Cui
2025, 15(3):307-328. DOI: 10.21655/ijsi.1673-7288.00351
Abstract:Instruction-level parallelism is a fundamental challenge in processor architecture research. Very long instruction word (VLIW) architecture is widely used in the field of digital signal processing to enhance instruction-level parallelism. In VLIW architecture, the instruction issue order is determined by the compiler, making its performance highly dependent on the compiler's instruction scheduling. To explore the potential of the RISC-V VLIW architecture and further enrich the RISC-V ecosystem, this study focuses on optimizing instruction scheduling algorithms for the RISC-V VLIW architecture. For a single scheduling region, the integer linear programming (ILP) scheduling can achieve optimal solutions but suffers from high computational complexity, whereas list scheduling offers lower complexity at the cost of potentially suboptimal solutions. To leverage the strengths of both approaches, this paper proposes a hybrid instruction scheduling algorithm. The scheduling region where the list scheduling has not reached the optimal solution can be located with the IPC theoretical model, and then the integer linear programming scheduling algorithm further processes the located scheduling region. The theoretical model is based on data flow analysis, accounting for both instruction dependencies and hardware resources, and provides a theoretical upper bound for IPC with linear complexity. The accuracy of the IPC theoretical model is a critical factor for the success of hybrid scheduling and achieves 95.74% accuracy in this study. On the given benchmark, the IPC model identifies that 94.62% of scheduling regions have reached optimal solution with list scheduling, leaving only 5.38% requiring further refinement with ILP scheduling. The proposed hybrid scheduling algorithm achieves the scheduling quality of ILP scheduling while maintaining a complexity comparable to that of list scheduling.
Jinchi Han , Zhidong Wang , Hao Ma , Wei Song
2025, 15(3):329-348. DOI: 10.21655/ijsi.1673-7288.00352
Abstract:Cache simulators are indispensable tools for exploring cache architectures and researching cache side channels. Spike, the standard implementation of the RISC-V instruction set, offers a comprehensive environment for RISC-V-based cache research. However, its cache model suffers from limitations, such as low simulation granularity and notable discrepancies with the cache structures of real processors. To address these limitations, this paper introduces the FlexiCAS (flexible cache architecture simulator), a modified and extended version of Spike's cache model. The modified simulator, referred to as Spike-FlexiCAS, supports a wide range of cache architectures with flexible configuration and easy extensibility. It enables arbitrary combinations of cache features, including coherence protocols and implementation methods. In addition, FlexiCAS can simulate cache behavior independently of Spike. The performance evaluations demonstrate that FlexiCAS significantly outperforms the cache model of ZSim, the fastest execution-driven simulator available.
Chuandong Li , Ran Yi , Yingwei Luo , Xiaolin Wang , Zhenlin Wang
2025, 15(3):349-367. DOI: 10.21655/ijsi.1673-7288.00353
Abstract:Memory virtualization, a core component of virtualization technology, directly impacts the overall performance of virtual machines. Current memory virtualization approaches often involve a tradeoff between the overhead of two-dimensional address translation and page table synchronization. Traditional shadow paging employs an additional software-maintained page table to achieve address translation performance comparable to native systems. However, synchronization of shadow page tables relies on write protection, frequently causing VM-exits that significantly degrade system performance. In contrast, the nested paging approach leverages hardware-assisted virtualization, allowing the guest page table and nested page table to be directly loaded into the MMU. While this eliminates page table synchronization, the two-dimensional page table traversal will seriously degrade the address translation performance. Two-dimensional page table traversal incurs substantial performance penalties for address translation due to privilege overhead. This paper proposes lazy shadow paging (LSP), which reduces page table synchronization overhead while retaining the high efficiency of shadow page tables. Leveraging the privilege model and hardware features of the RISC-V architecture, LSP analyzes the access patterns of guest OS page tables and binds synchronization with translation lookaside buffer (TLB) flushes, reducing the software overhead associated with page table updates by deferring costs until the first access to a relevant page to minimize VM-exits. In addition, it introduces a fast path for handling VM-exits, exploiting the fine-grained TLB interception and privilege-level features of RISC-V to further optimize performance. Experimental results demonstrate that under the baseline RISC-V architecture, LSP reduces VM-exits by up to 50% compared to traditional shadow paging in micro-benchmark tests. For typical applications in the SPEC2006 benchmark suite, LSP reduces VM-exits by up to 25% compared to traditional shadow paging and decreases memory accesses per TLB miss by 12 compared to nested paging.
Liutong Han , Hongbin Zhang , Mingjie Xing , Yanjun Wu , Chen Zhao
2025, 15(3):369-395. DOI: 10.21655/ijsi.1673-7288.00354
Abstract:The performance acceleration of high-performance libraries on CPUs can be achi\-eved by leveraging SIMD hardware through vectorization. Implementing vectorization requires programming methods tailored to the target SIMD hardware, which vary significantly across different SIMD extensions. To avoid redundant implementations of algorithm optimizations on various platforms and enhance the maintainability of algorithm libraries, a hardware abstraction layer (HAL) is often introduced. However, most existing HAL designs are based on fixed-length vector registers, aligning with the fixed-length nature of conventional SIMD extension instruction sets. This design fails to accommodate the variable-length vector register introduced by the RISC-V vector extension. Treating RISC-V vector extensions as fixed-length vectors within traditional HAL designs results in unnecessary overhead and performance degradation. To address this problem, the paper proposes a HAL design method compatible with both variable-length vector extensions and fixed-length SIMD extensions. Using this approach, the universal intrinsic functions in the OpenCV library are redesigned and optimized to better support RISC-V vector extension devices while maintaining compatibility with existing SIMD platforms. Performance comparisons between the optimized and original OpenCV libraries reveal that the redesigned universal intrinsic function efficiently integrates RISC-V vector extensions into the HAL optimization framework, achieving a 3.93 times performance improvement in core modules. These results validate the effectiveness of the proposed method, significantly enhancing the execution performance of high-performance libraries on RISC-V devices. In addition, the proposed approach has been open-sourced and integrated into the OpenCV repository, demonstrating its practicality and application value.
