EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

We introduce COMI-LINGUA, the largest manually annotated Hindi-English code-mixed dataset, comprising 125K+ high-quality instances across five core NLP tasks: Matrix Language Identification, Token-level Language Identification, POS Tagging, Named Entity Recognition, and Machine Translation. Each instance is annotated by three bilingual annotators, yielding over 376K expert annotations with strong inter-annotator agreement (Fleiss' Kappa >= 0.81). The rigorously preprocessed and filtered dataset covers both Devanagari and Roman scripts, and spans diverse domains such as social media text, news, and formal communications, ensuring real-world linguistic coverage. Evaluation reveals that closed-source LLMs significantly outperform traditional tools and open-source models. Notably, one-shot prompting consistently boosts performance across tasks, especially in structure-sensitive predictions like POS and NER, highlighting the effectiveness of prompt-based adaptation in code-mixed, low-resource settings. COMI-LINGUA is publicly available at: \url{https://anonymous.4open.science/r/CodeMixing/}.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree
poster

cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree

EMNLP 2025

+3
Zhiruo Wang and 5 other authors

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved