• Volume 15,Issue 2,2025 Table of Contents
    Select All
    Display Type: |
    • Preface to the Special Issue on Software Quality Assurance in the Era of Large Language Models

      2025, 15(2):139-144. DOI: 10.21655/ijsi.1673-7288.00362

      Abstract (131) HTML (0) PDF 901.74 K (705) Comment (0) Favorites

      Abstract:Preface

    • Insights and Analysis of Open-source License Violation Risks in LLMs Generated Code

      2025, 15(2):145-173. DOI: 10.21655/ijsi.1673-7288.00363

      Abstract (120) HTML (0) PDF 1.26 M (707) Comment (0) Favorites

      Abstract:The field of software engineering has been significantly influenced by the rapid development of large language models (LLMs). These models, which are pre-trained with a vast amount of code from open-source repositories, are capable of efficiently accomplishing tasks such as code generation and code completion. However, a large number of programs in the open-source software repositories are constrained by open-source licenses, bringing potential open-source license violation risks to the large models. This paper focuses on the license violation risks between code generated by LLMs and open-source repositories. A detection framework that supports the tracing of the source of code generated by large models and the identification of copyright infringement issues is developed based on code clone technology. For 135,000 Python code fragments generated by 9 mainstream code large models, the source is traced and the open-source license compatibility is detected in the open-source community by this framework. Through practical investigation of three research questions, the impact of large model code generation on the open-source software ecosystem is explored: (1) To what extent is the code generated by large models cloned from open-source software repositories? (2) Is there a risk of open-source license violations in the code generated by large models? (3) Is there a risk of open-source license violations in the large model-generated code included in real open-source software? The experimental results indicate that among the 43,130 and 65,900 Python code fragments longer than six lines generated by using functional descriptions and method signatures, 68.5% and 60.9% of the programs respectively are traced to have cloned open-source code segments. The CodeParrot and CodeGen series models have the highest clone ratios, while GPT-3.5-Turbo has the lowest. Besides, 92.7% of the code files generated by using functional descriptions lack license declaration. By comparing with the licenses of the traced code fragments, 81.8% of the code files have open-source license violation risks. Furthermore, among 229 program files generated by LLMs collected from GitHub, 136 code samples are traced to have open-source code segments, among which 38 are of Type1 and Type2 clone types, and 30 have open-source license violation risks. These issues are reported to the developers in the form of problem reports. Up to now, feedback has been received from eight developers.

    • Exploration and Improvement of Capabilities of LLMs in Code Refinement Task

      2025, 15(2):175-203. DOI: 10.21655/ijsi.1673-7288.00364

      Abstract (109) HTML (0) PDF 8.82 M (1310) Comment (0) Favorites

      Abstract:As a crucial part of automated code review, the code refinement task is of great significance for improving development efficiency and code quality. Since large language models (LLMs) have shown far better performance than traditional small-scale pre-trained models in the field of software engineering, this paper aims to explore the performance of these two types of models in the task of automated code refinement, so as to evaluate the comprehensive advantages of LLMs. The traditional code quality evaluation metrics (e.g., BLEU, CodeBLEU, edit progress) are used to evaluate the performance of four mainstream LLMs and four representative small-scale pre-trained models in the code refinement task. Findings indicate that the refinement quality of LLMs in the pre-review code refinement subtask is inferior to that of small-scale pre-trained models. Due to the difficulty of the existing code quality evaluation metrics in explaining the above phenomenon, this work proposes Unidiff-based code refinement evaluation metrics to quantify the change operations in the refinement process, in order to explain the reasons for the inferiority and reveal the tendency of the models to perform change operations: (1) the pre-review code refinement task is rather difficult, the accuracy of the models in performing correct change operations is extremely low, and LLMs are more ``aggressive'' than small-scale pre-trained models, that is, they tend to perform more code change operations, resulting in their poor performance; (2) compared with small-scale pre-trained models, LLMs tend to perform more \texttt{ADD} and MODIFY change operations in the code refinement task, and the average number of inserted code lines in ADD change operations is larger, further proving their ``aggressive'' nature. To alleviate the disadvantages of LLMs in the pre-review refinement task, this work introduces the LLM-Voter method based on LLMs and ensemble learning, which includes two sub-schemes: Inference-based and Confidence-based, aiming to integrate the advantages of different base models to improve the code refinement quality. On this basis, a refinement determination mechanism is further introduced to enhance the decision stability and reliability of the model. Experimental results demonstrate that the Confidence-based LLM-Voter method significantly increases the exact match (EM) and obtains a refinement quality better than all base models, thus effectively alleviating the disadvantages of large language models.

    • Detection of Resource Leaks in Java Programs: Effectiveness Analysis of Traditional Models and Language Models

      2025, 15(2):205-232. DOI: 10.21655/ijsi.1673-7288.00365

      Abstract (127) HTML (0) PDF 4.40 M (937) Comment (0) Favorites

      Abstract:Resource leaks, which are defects caused by the failure to timely and properly close the limited system resources, are widely present in programs of various languages and possess a certain degree of concealment. The traditional defect detection methods usually predict the resource leaks in software based on rules and heuristic search. In recent years, defect detection methods based on deep learning have captured the semantic information in the code through different code representation forms and by using techniques such as recurrent neural networks and graph neural networks. Recent studies show that language models have performed outstandingly in tasks such as code understanding and generation. However, the advantages and limitations of large language models (LLMs) in the specific task of resource leak detection have not been fully evaluated. The effectiveness of the detection methods based on traditional models, small models, and LLMs in the task of resource leak detection is studied, and various improvement methods such as few-shot learning, fine-tuning and the combination of static analysis and LLMs are explored. Specifically, taking the JLeaks and DroidLeaks datasets as the experimental objects, the performance of different models is analyzed from multiple dimensions such as the root causes of resource leaks, resource types and code complexity. The experimental results show that the fine-tuning technique can significantly improve the detection effect of LLMs in the field of resource leak detection. However, most models still need to be improved in identifying the resource leaks caused by third-party libraries. In addition, the code complexity has a greater influence on the detection methods based on traditional models for resource leak detection.

    • Large-language-model-based Decomposition of Long Methods

      2025, 15(2):233-250. DOI: 10.21655/ijsi.1673-7288.00366

      Abstract (122) HTML (0) PDF 738.33 K (685) Comment (0) Favorites

      Abstract:Long methods, along with other types of code smells, prevent software applications from reaching their optimal readability, reusability, and maintainability. Consequently, automated detection and decomposition of long methods have been widely studied. Although these approaches have significantly facilitated decomposition, their solutions often differ significantly from the optimal ones. To address this, the automatable portion of the publicly available dataset containing real-world long methods is investigated. Based on the findings of this investigation, a new method (called Lsplitter) based on large language models (LLMs) is proposed in this paper for automatically decomposing long methods. For a given long method, Lsplitter decomposes the method into a series of shorter methods according to heuristic rules and LLMs. However, LLMs often split out similar methods. In response to the decomposition results of LLMs, Lsplitter utilizes a location-based algorithm to merge physically contiguous and highly similar methods into a longer method. Finally, these candidate results are ranked. Experiments are conducted on 2,849 long methods in real Java projects. The experimental results show that compared with the traditional methods combined with a modularity matrix, the hit rate of Lsplitter is improved by 142%, and compared with the methods purely based on LLMs, the hit rate is improved by 7.6%.

    • LLM-powered Datalog Code Translation and Incremental Program Analysis Framework

      2025, 15(2):251-276. DOI: 10.21655/ijsi.1673-7288.00367

      Abstract (118) HTML (0) PDF 597.48 K (676) Comment (0) Favorites

      Abstract:Datalog, a declarative logic programming language, is widely applied in various fields. In recent years, there has been a growing interest in Datalog from both the academic and industrial communities, leading to the design and development of multiple Datalog engines and corresponding dialects. However, one problem brought about by the multiple dialects is that the code implemented in one Datalog dialect generally cannot be executed on the engine of another dialect. Therefore, when a new Datalog engine is adopted, the existing Datalog code needs to be translated into the new dialect. The current Datalog code translation techniques can be classified into two categories: manually rewriting the code and manually designing translation rules, which have problems such as being time-consuming, involving a large amount of repetitive work, and lacking flexibility and scalability. In this work, a Datalog code translation technique empowered by large language model (LLM) is proposed. By leveraging the powerful code understanding and generation capabilities of LLMs, through the divide-and-conquer translation strategy, the prompt engineering based on few-shot and chain-of-thought prompts, and an iterative error-correction mechanism based on check-feedback-correction, high-precision code translation between different Datalog dialects can be achieved, reducing the workload of developers in repeatedly developing translation rules. Based on this code translation technique, a general Datalog-based declarative incremental program analysis framework is designed and implemented. The performance of the proposed LLM-powered Datalog code translation technique is evaluated on different Datalog dialect pairs, and the evaluation results verify the effectiveness of the proposed code translation technique. This paper also conducts an experimental evaluation of the general declarative incremental program analysis framework, verifying the speedup effect of incremental program analysis based on the proposed code translation technique.