Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
AI-based code generation is increasingly prevalent, with GitHub Copilot estimated to generate 46% of GitHub code. Accurately evaluating the alignment of generated code with developer intent remains a critical challenge. Traditional evaluation methods, like unit tests, are often unscalable and costly. Syntactic similarity metrics (e.g., BLEU, ROUGE) fail to capture code functionality, and metrics like CodeBERTScore require reference code that is not always available. Recognizing the gap in reference-free evaluation, with few alternatives like ICE-Score, this paper introduces MATCH, a novel reference-free metric. MATCH employs Contrastive Learning to generate meaningful embeddings for code and natural language task descriptions, enabling similarity scoring that reflects how well generated code implements the task. We demonstrate that MATCH achieves stronger correlations with functional correctness and human preference than existing metrics across several programming languages.