India

Recent advances in singing voice synthesis (SVS) have attracted substantial attention from both academia and industry. With the advent of large language models and novel generative paradigms, producing controllable, high‑fidelity singing voices has become an attainable goal. Yet the field still lacks a comprehensive survey that systematically analyzes deep‑learning‑based singing voice systems and their enabling technologies.
To address the aforementioned issue, this survey first categorizes existing systems by task type and then organizes current architectures into two major paradigms: cascaded and end-to-end approaches. Moreover, we provide an in-depth analysis of core technologies, covering singing modeling and control techniques. Finally, we review relevant datasets, annotation tools, and evaluation benchmarks that support training and assessment. In appendix, we introduce training strategies and further discussion of SVS. This survey provides an up-to-date review of the literature on SVS models, which would be a useful reference for both researchers and engineers. Related materials are available at https://github.com/David-Pigeon/SyntheticSingers.

IJCNLP-AACL 2025

Synthetic Singers: A Review of Deep-Learning-based Singing Voice Synthesis Approaches

singing resources

audio representations

singing voice synthesis

style transfer

technical paper

### Welcome to IJCNLP-AACL 2025! 
 It is a great honor to host this joint conference in Mumbai, India, from December 20 to 24, 2025. The joint conferences of IJCNLP and AACL are organized with alternating leadership in the Asia-Pacific region. The event is run by the Asian Federation of Natural Language Processing (AFNLP) in odd years, and by AACL in even years, while it is organized solely by ACL when the annual ACL meeting is held in the region. This year, the conference is primarily organized by AFNLP. 
*Kentaro Inui
MBZUAI, UAE
General Chair, IJCNLP-AACL 2025* 
Read full message and download the Conference Handbook [**here**](https://drive.google.com/file/d/1UTwxkAqSqI-GAoJC3wE1zZt5VP1Y8GX0/view?usp=sharing).

The 14th IJCNLP & 4th AACL will be held in Mumbai, India from December 20th to December 24th, 2025.

Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations, but dialect specific model training alleviates the issue. Our dataset also serves as a out-of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available.

Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?

Recent advances in speech‑enabled AI, including Google's NotebookLM and OpenAI's speech-to-speech API, are driving widespread interest in voice interfaces across sectors such as finance, health, agritech, legal services, and call‑centers in the global north and south. Despite this momentum, there exists no publicly available application-specific model evaluation that caters to Africa's linguistic diversity. We present $\textbf{AfriSpeech‑MultiBench}$, the first domain‑specific evaluation suite for over 100 African English accents across 10+ countries and seven application domains: Finance, Legal, Medical, General dialogue, Call Center, Named Entities, and Hallucination Robustness. We benchmark a diverse range of open, closed, unimodal ASR and multimodal LLM-based speech recognition systems using both spontaneous and non-spontaneous speech conversations drawn from various open African accented English speech datasets. Our empirical analysis reveals systematic variation: open‑source ASR excels in spontaneous speech contexts but degrades on noisy, non‑native dialogue; multimodal LLMs are more accent‑robust yet struggle with domain‑specific named entities; proprietary models deliver high accuracy on clean speech but vary significantly by country and domain. Smaller models fine‑tuned on African English achieve competitive accuracy with lower latency, a practical advantage for deployment. By releasing this benchmark, we empower practitioners and researchers to select voice technologies suited to African use‑cases, fostering inclusive voice applications for undeserved communities.

AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR

Clinical trials are designed in natural language and the task of matching them to patients, represented via both structured and unstructured textual data, benefits from knowledge aggregation and reasoning abilities of LLMs. LLMs with their ability to consolidate distributed knowledge hold the potential to build a more general solution than classical approaches that employ trial-specific heuristics. Yet, adoption of LLMs in critical domains, such as clinical research, comes with many challenges, such as, the availability of public benchmarks, the dimensions of evaluation and data sensitivity. In this survey, we contextualize emerging LLM-based approaches in clinical trial recruitment. We examine the main components of the clinical trial recruitment process, discuss existing challenges in adopting LLM technologies in clinical research and exciting future directions.

A Survey on LLM-Assisted Clinical Trial Recruitment

Predictive modeling of hospital patient data is challenging due to its structured format, irregular timing of measurements, and variation in data representation across institutions. While traditional models often struggle with such inconsistencies, Large Language Models (LLMs) offer a flexible alternative. In this work, we propose a method for verbalizing structured Electronic Health Records (EHRs) into a format suitable for LLMs and systematically examine how to include time-stamped clinical observations—such as lab tests and vital signs—from previous time points in the prompt. We study how different ways of structuring this temporal information affect predictive performance, and whether fine-tuning alone enables LLMs to effectively reason over such data. Evaluated on two real-world hospital datasets and MIMIC-IV, our approach achieves strong in-hospital and cross-hospital performance, laying the groundwork for more generalizable clinical modeling.

Decode Like a Clinician: Enhancing LLM Fine-Tuning with Temporal Structured Data Representation

Large language models (LLMs) are increasingly being used for complex research tasks such as literature review, idea generation, and scientific paper analysis, yet their ability to truly understand and process the intricate relationships within complex research papers, such as the logical links between claims and supporting evidence remains largely unexplored. In this study, we present CLAIM-BENCH, a comprehensive benchmark for evaluating LLMs' capabilities in scientific claim-evidence extraction and validation, a task that reflects deeper comprehension of scientific argumentation. We systematically compare three approaches which are inspired by divide and conquer approaches, across six diverse LLMs, highlighting model-specific strengths and weaknesses in scientific comprehension. Through evaluation involving over 300 claim-evidence pairs across multiple research domains, we reveal significant limitations in LLMs' ability to process complex scientific content. Our results demonstrate that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall across claim-evidence identification tasks. Furthermore, strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs' abilities to accurately link dispersed evidence with claims, although this comes at increased computational cost. CLAIM-BENCH sets a new standard for evaluating scientific comprehension in LLMs, offering both a diagnostic tool and a path forward for building systems capable of deeper, more reliable reasoning across full-length papers.

Can AI Validate Science? Benchmarking LLMs on Claim →Evidence Reasoning in AI Papers

We present LITMUS++, a prototype demo system that predicts model performance for queries of the form “How will a Model perform on a Task in a Language?” even when no benchmark exists. The system replaces static suites and opaque LLM-as-judge setups with an agentic, auditable workflow: a Directed Acyclic Graph (DAG) of specialized Thought Agents that hypothesize, retrieve multilingual evidence, select features, and train lightweight regressors with calibrated uncertainty. The web interface offers a chat entry point, a live reasoning view, and an evidence panel with citations and exportable reports. Experiments across six tasks and five multilingual scenarios show that LITMUS++ delivers accurate predictions with interpretable reasoning.

LITMUS++ : An Agentic System for Predictive Analysis of Low-Resource Languages Across Tasks and Models

We present a modular, interactive system, SPORTSQL, for natural language querying and visualization of dynamic sports data, with a focus on the English Premier League (EPL). The system translates user questions into executable SQL over a live, temporally indexed database constructed from real-time Fantasy Premier League (FPL) data. It supports both tabular and visual outputs, leveraging symbolic reasoning capabilities of Large Language Models (LLMs) for query parsing, schema linking, and visualization selection. To evaluate system performance, we introduce the Dynamic Sport Question Answering Benchmark (DSQABENCH), comprising 1,700+ queries annotated with SQL programs, gold answers, and database snapshots. Our demo highlights how non-expert users can seamlessly explore evolving sports statistics through a natural, conversational interface.

SPORTSQL: An Interactive System for Real-Time Sports Reasoning and Visualization

Large Language Models (LLMs) have been positioned as having the potential to expand access to health information in the Global South, yet their evaluation remains heavily dependent on benchmarks designed around Western norms. We present insights from a preliminary benchmarking exercise with a chatbot for sexual and reproductive health (SRH) for an underserved community in India. We evaluated using HealthBench, a benchmark for conversational health models by OpenAI. We extracted 637 SRH queries from the dataset and evaluated on the 330 single-turn conversations. Responses were evaluated using HealthBench's rubric-based automated grader, which rated responses consistently low. However, qualitative analysis by trained annotators and public health experts revealed that many responses were actually culturally appropriate and medically accurate. We highlight recurring issues, particularly a Western bias, such as for legal framing and norms (e.g., breastfeeding in public), diet assumptions (e.g., fish safe to eat during pregnancy), and costs (e.g., insurance models). Our findings demonstrate the limitations of current benchmarks in capturing the effectiveness of systems built for different cultural and healthcare contexts. We argue for the development of culturally adaptive evaluation frameworks that meet quality standards while recognizing needs of diverse populations.

Beyond the Rubric: Cultural Misalignment in LLM Benchmarks for Sexual and Reproductive Health

Large language models (LLMs) are now used in scientific peer review, but their judgments can still be influenced by how information is presented.
We study how the style of a paper’s title affects the way LLMs score scientific work.
To control for content variation, we build the TitleTrap benchmark using abstracts generated by a language model for common research topics in computer vision and NLP.
Each abstract is paired with three titles: a branded colon style, a plain descriptive style, and an interrogative style, while the abstract text remains fixed.
We ask GPT-4o and Claude to review these title–abstract pairs under the same instructions.
Our results show that title style alone can change the scores: branded titles often receive higher ratings, while interrogative titles sometimes lead to lower assessments of rigor.
These findings reveal a presentation bias in LLM-based peer review and suggest the need for better methods to reduce such bias and support fairer automated evaluation.

TitleTrap: Probing Presentation Bias in LLM-Based Scientific Reviewing

Similar to human intelligence, which is highly complex in nature, evaluating large language models becomes especially challenging when they move beyond well-defined, STEM-style tasks into socially and culturally rich domains. The first part of this talk focuses on assessing social intelligence in LLMs, exploring their ability to handle phenomena where ambiguity, cultural difference, and subjectivity make “correctness” difficult to define, and where capabilities beyond text are required, such as omni-modal sensory understanding. The talk then examines the role of LLMs as evaluators, considering their reliability, biases, and prompt sensitivity, and concludes with reflections on building more robust and socially grounded evaluation frameworks.

Downloads

Next from IJCNLP-AACL 2025

Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES