Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Recent advances in large-scale code generation models, trained in a self-supervised manner on extensive unlabeled code corpora, have led to notable progress in generating high-quality code. Despite their success in generative tasks, these decoder-only models often underperform on code understanding tasks such as code search and clone detection, due to the generation-oriented nature of their training objectives. While training a large encoder-only model from scratch on massive code data may enhance understanding performance, this approach is typically resource-intensive and time-consuming. In this paper, we explore a more efficient alternative by transferring knowledge from pre-trained decoder-only code generation models to code understanding tasks. We investigate effective strategies for enabling decoder-only architectures to learn meaningful code representations suitable for comprehension. To this end, we propose CL4D, a contrastive learning framework tailored to strengthen the representation capabilities of decoder-only models. Extensive experiments on benchmark datasets demonstrate that our approach achieves competitive or superior performance compared to existing methods on tasks such as code search and clone detection. The results indicate that CL4D improves the semantic alignment of code representations by reducing the distance between semantically similar code snippets.