Singapore

Users should not be systemically disadvantaged by the language they use for interacting with LLMs; i.e. users across languages should get responses of similar quality irrespective of language used. In this work, we evaluate whether responses to real-world, open-ended questions vary by language, specifically, whether answer quality depends on the language used to query the model. We also investigate how language and culture are entangled in LLMs such that choice of language changes the cultural information and context used in the response. To investigate this, we evaluate LLMs on a translated subset of the CulturalBench benchmark across multiple languages. Our evaluations reveal that LLMs consistently provide worse quality answers to open-ended questions in low resource languages. We find that language significantly impacts the cultural context used by the model. This difference in context impacts the quality of the downstream answer.

AAAI 2026

Language Models Entangle Language and Culture

workshop paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Interpretability and robustness remain major challenges for modern Large Language Models (LLMs), especially in settings where conventional evaluation or auditing tools are limited. To address this, we propose Inverse Language Modeling (ILM), a unified training framework that jointly enhances robustness to adversarial perturbations and enables a novel form of gradient-based interpretability. Rather than reconstructing exact input prompts, ILM encourages LLMs to develop gradient-aligned internal representations that allow the model to approximate plausible input patterns underlying a given output. This approximate inversion provides a new mechanism for analyzing model behavior, identifying potential triggers for unsafe generations, and supporting lightweight governance and red-teaming workflows. Our results show that ILM can simultaneously improve robustness and produce meaningful inversion signals, laying a foundation for LLMs that are not only more resilient but also more transparent and analyzable.

Inverse Language Modeling towards Robust and Grounded LLMs

Recent multilingual language models promise support for “100+ languages,” yet speakers of Indigenous and other underrepresented languages still often do not see themselves in these advances. In this work, we take a deliberately simple, secondary-benchmark perspective: rather than proposing a new model or dataset, we re-evaluate an off-the-shelf multilingual natural language inference (NLI) model on public benchmarks that explicitly include Indigenous languages of the Americas. Concretely, we use the AmericasNLI benchmark for ten Indigenous languages and XNLI for English and Spanish, and we evaluate the widely used joeddav/xlm-roberta-large-xnli model under a fixed, zero-shot protocol. Our goal is to answer three questions: (i) How large is the performance gap between high- resource and underrepresented languages under the same model and task? (ii) Are these gaps consistent across languages, or do some communities fare systematically worse than others? (iii) What kinds of qualitative errors arise, and what do they suggest about cultural and linguistic mismatch? Our experiments reveal a striking discrepancy: while English and Spanish reach almost perfect accuracy on XNLI (around 99.8% on our runs), the same model averages only about 43% accuracy across ten Indigenous languages in AmericasNLI, with none exceeding 47%. We also show qualitative NLI failures in Quechua that point to difficulties with morphology, idioms, and discourse-level inference. We argue that even such a simple re-analysis can serve as a low-cost yet high-impact tool for making inequities in multilingual NLP visible, especially for communities that rarely appear in headline benchmarks.

Advancing NLP Equity: A Secondary Benchmark Evaluation of Multilingual Language Models for Underrepresented Languages

Guardian models monitor and regulate the outputs of user-facing AI systems. However, current guardian models fall short in two key ways. First, they are predominantly Western-centric and optimized for high-resource languages, leaving low-resource African languages vulnerable to evolving harms, cross-lingual safety failures, and cultural misalignment. Second, most guardian models rely on rigid, predefined safety categories that do not generalize across diverse linguistic and sociocultural contexts. Ensuring robust safety requires flexible, runtime-enforceable policies and benchmarks that reflect local norms, harm scenarios, and cultural expectations. We introduce UbuntuGuard, the first African policy-based safety benchmark built from adversarial queries authored by 155 domain experts across sensitive fields, including healthcare, education, government, and finance. From these expert-crafted queries, we derive context-specific safety policies and reference responses that capture culturally grounded risk signals, enabling policy-aligned evaluation of guardian models. We evaluate six state-of-the-art guardian models, including static, dynamic, and multilingual variants, under multiple scenarios. Our findings reveal that existing English-centric benchmarks overestimate real-world multilingual safety, cross-lingual transfer provides partial but insufficient coverage, and dynamic models, while better equipped to leverage policies at inference time, still struggle in fully localized African-language contexts. These findings highlight the urgent need for multilingual, culturally grounded safety benchmarks to enable the development of reliable and equitable guardian models for low-resource languages.


UbuntuGuard: A Policy-Based Safety Benchmark for Low-Resource African Languages

Static leaderboards and single turn judgments correlate weakly with deployment outcomes, especially in multilingual and resource constrained settings. This position paper argues that credible evaluation hinges on verifiability: ex ante specifications that permit observable checks, repeatable scoring, and auditable evidence. We propose a minimal standard that makes verifiability first class while remaining compatible with existing workflows. The standard has four artifacts: a task schema, a validator entry point, a run card, and required reporting fields. We ground the proposal in prior work on coverage and transparency and on specification based checks. We present a prototype evaluation task for schema constrained instruction following with robustness probes and a multilingual protocol, and we attach measurement and governance procedures that link scores to validity arguments. The goal is to replace generic win rates with verifiable claims about task success that better predict real use across languages and contexts.

Beyond Static Leaderboards: A Roadmap to Naturalistic, Functional Evaluation of LLMs

Underserved and extremely low-resource languages challenge current language technologies, especially when lexical borrowing and synonymy undermine exact-match assumptions. We study Bahnaric-Vietnamese lexical mapping as a step toward meaning-preserving sentence translation. Unlike prior work based on static embeddings and Mean Squared Error (MSE) alignment, we learn sentence-aware word representations with a small multilingual transformer pretrained on Vietnamese, adapt it with Low-rank adaptation (LoRA) for parameter efficiency, and align Bahnaric-Vietnamese pairs using a two-layer projection trained with InfoNCE contrastive loss. We exploit a new community-sourced lexicon of approximately 10,000 Bahnaric-Vietnamese pairs collected with local partners, capturing one-to-one, one-tomany, and many-to-one anchor relations as well as extensive lexical borrowing. Experiments evaluate retrieval-style alignment with Precision at K (P@K) and Mean Reciprocal Rank (MRR), as well as sentence translation using top-1 accuracy, Bilingual Evaluation Understudy (BLEU) and Character ngram F-score (ChrF). On the ∼1k lexicon, our best model attains P@1 ≈ 0.53 and MRR ≈ 0.62, substantially improving over a static-embedding MSE baseline, while on the richer ∼10k community lexicon it reaches comparable sentencelevel top-1 accuracy despite slightly lower BLEU and chrF, highlighting both the benefits of the expanded resource and the remaining challenges of synonym-rich, low-frequency vocabulary.

Sentence-Aware Bahnaric-Vietnamese Lexical Mapping with Contrastive Contextual Representations

We present ENLIVEN-1000, a unified framework for endangered and low-resource language revitalization that integrates broad-coverage language identification (LID), machine translation (MT), and LLM-generated synthetic data—aimed at expanding safe, equitable NLP support for communities historically excluded from mainstream tools. We compile a text corpus for 1154 languages (1069 endangered or low-resource) from public sources and train a fastText-based LID model covering this vast set. The LID system achieves high detection quality with F1 ≈ 0.99 and FPR ≈ 3×10−6, substantially broadening reliable coverage beyond existing solutions. Focusing on five diverse endangered languages—Carpathian Romani, Chuj, Sunwar, Kapingamarangi, and Inuktitut—we fine-tune a 600M-parameter NLLB-200 model for translation. Our fine-tuned models outperform zero-shot baselines and even proxy models trained on related, high-resource languages, in both directions (endangered -> English and English -> endangered). We further use GPT-4o to generate synthetic parallel data, demonstrating that augmenting limited real data with LLM-generated text yields substantial MT improvements. These results illustrate a practical path toward scaling NLP support to hundreds of under-resourced languages. We discuss implications for language revitalization and ethical considerations in working with endangered language communities.



ENLIVEN-1000: A Comprehensive Revitalization Framework for 1000+ Endangered Languages via Broad-Coverage LID and LLM-Augmented MT

Large language models (LLMs) enable scalable conversational support for postpartum depression (PPD), yet current systems insufficiently account for intra-lingual cultural variation even within high-resource languages such as Chinese. Dialectal phrasing, local idioms, and culturally embedded expressions (e.g., Northeastern Mandarin "zhabayue de teng" (humorous discomfort) or the Southern Min "xin-gua-a-tia
" (deep sorrow)) often produce misinterpretation, safety-critical ambiguity, or emotionally inappropriate responses in PPD-related dialogues. We introduce CAMA (Culturally Adaptive Multi-Agent Co-Design Framework), a lightweight cultural-sensitivity detection and alignment framework that identifies dialect-specific linguistic cues and supplements LLMs with contextual socio-cultural grounding without performing clinical diagnosis. Our approach integrates culturally aware prompting and intervention logic to enhance empathy, safety, relevance, and user trust. This work highlights that cultural fairness in mental-health LLMs must consider intra-language diversity, not only cross-lingual disparity. CAMA provides a practical pathway towards culturally aligned, safe, and trustworthy mental-health dialogue systems.

CAMA: A Culturally Adaptive Multi-Agent Framework for Postpartum Depression Support in Multilingual and Low-Resource Settings

Tokenization serves as a crucial preprocessing step in multilingual language models, affecting performance in both high-resource and low-resource languages. However, current tokenizers seem to adopt language biases due to unbalanced training datasets, leading to a poorly optimized tokenizer for underrepresented languages. This research examines the impact of balanced multilingual datasets on the performance of tokenizers trained with the Byte Pair Encoding, WordPiece, and Unigram Language Model algorithms. We build balanced corpora from various sources to study the impact of vocabulary size on 15k, 30k, 50k dataset scales. The trained tokenizers are assessed through intrinsic metrics, including Subword Fertility and Normalized Sequence Length, as well as through extrinsic performance on downstream tasks like Part-of-Speech tagging, Named Entity Recognition, and Machine Translation. We build custom data sets along with customized evaluation pipelines to enable consistent comparisons across nine languages using models built into standard NLP frameworks. Our observations reinforce the importance of a balanced dataset when training tokenizers and, in turn, advance the development of equitable and robust multilingual NLP systems.

From Bias to Balance: How Multilingual Dataset Composition Affects Tokenizer Performance Across Languages

Neural Machine Translation (NMT) for low-resource and underserved languages remains challenging due to the severe lack of parallel corpora, linguistic tools, and evaluation resources. The issue is evident in Vietnam, where the ethnolinguistic minority languages Tày (Tai–Kadai) and Bahnar (Austroasiatic) hold cultural significance but remain digitally under-represented. Data Augmentation (DA) offers a cost-effective remedy; however, most existing techniques were designed for high-resource analytic languages and are often applied heuristically without assessing their linguistic compatibility. In this work, we present the first systematic study of DA for two minority language pairs, Tày–Vietnamese and Bahnar–Vietnamese, within a three-stage language model pipeline consisting of Vietnamese-based initialization, monolingual adaptation, and supervised fine-tuning. We train two independent encoder–decoder NMT systems to isolate augmentation effects and analyze how linguistic typology shapes augmentation behavior. Our experiments show that meaning-preserving DA methods consistently improve translation adequacy and linguistic faithfulness, whereas several widely used techniques introduce semantic or structural degradation. Through quantitative evaluation and typology-aware linguistic analysis, we derive practical guidelines for selecting DA strategies in extremely low-resource and typologically diverse settings. We additionally release newly digitized high-quality bilingual corpora and trained models to facilitate future research and community-centered NLP development.

Not All Data Augmentation Works: A Typology-Aware Study for Low-Resource Neural Machine Translation in Vietnamese Ethnic Minority Languages

Evaluating the safety and alignment of AI systems remains a critical challenge as foundation models grow increasingly sophisticated. Traditional evaluation methods rely heavily on human expert review, creating bottlenecks that cannot scale with the rapid pace of AI development. We introduce Jo.E (Joint Evaluation), a novel multi-agent collaborative framework that combines large language model evaluators, specialized AI agents, and strategic human expert involvement to conduct comprehensive safety assessments. Our framework employs a five-phase evaluation pipeline that systematically identifies vulnerabilities across multiple safety dimensions including adversarial robustness, fairness, ethics, and accuracy. Through extensive experiments on state-of-the-art models including GPT-4o, GPT-5, Llama 3.2, Phi 3, and Claude Sonnet 4, we demonstrate that Jo.E achieves approximately 22% improvement in vulnerability detection while reducing human expert time requirements by 54% compared to traditional evaluation approaches. Our results show that automated collaborative evaluation can significantly enhance both the efficiency and effectiveness of AI safety assessment without sacrificing rigor or comprehensive coverage.

Premium content

Next from AAAI 2026

Inverse Language Modeling towards Robust and Grounded LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES