Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Software reliability remains a fundamental challenge in modern software engineering. The rapid rise of AI-assisted coding has dramatically improved development productivity but introduces a critical problem: AI-generated code cannot be inherently trusted. While AI coding tools accelerate development, they may produce code with subtle bugs, security vulnerabilities, and legal compliance issues, amplifying an already costly maintenance burden where developers spend the majority of their time fixing bugs. This unreliability of AI-generated code poses systemic risks to software quality, security, and compliance.
Addressing this challenge requires a two-pronged solution approach. First, we need reliable detection methods to identify AI-generated code, enabling targeted quality reviews and ensuring compliance with licensing requirements. Second, we need effective Automated Program Repair (APR) techniques to automatically localize faults and synthesize patches, reducing the growing burden of bug fixing in rapidly produced code. However, progress in both areas has been constrained by a critical limitation: the lack of comprehensive datasets and benchmarks, particularly for C and C++, which underpin most safety-critical systems. Moreover, current repair approaches, including recent large language models (LLMs), lack the semantic reasoning abilities necessary for complex bug fixing tasks, often relying on pattern matching rather than genuine program understanding.
This thesis addresses these interconnected challenges through four major contributions. First, we conduct the first comprehensive study of AI-generated code detection, evaluating thirteen detectors on over two million samples of code and natural language, and propose fine-tuning–based approaches that substantially improve detection accuracy. Second, we construct and release Defects4C, the first large-scale, executable benchmark for C/C++ bugs, curated from millions of real-world commits and designed to enable reproducible evaluation for bug detection and repair. Third, we propose a dual deep learning–based APR framework, integrating BiLSTM-based fault localization with a retrieval-augmented transformer for patch generation, and conduct the first large-scale evaluation of LLM-based APR on C/C++, revealing significant performance gaps compared to Java benchmarks and highlighting the limitations of current models in semantic understanding and code reasoning. Finally, we design a semantic-enhancement framework for LLMs, incorporating dynamic semantic signals such as code execution traces into training and inference, and demonstrate improvements in program repair and general code generation.
These contributions advance the foundations of trustworthy, semantically grounded automated program repair, providing new datasets, empirical insights, and methodological innovations that will guide the future development of reliable AI-driven software engineering.
