United States

This paper presents LLaMA-Berry, an advanced mathematical reasoning framework to enhance the problem-solving ability of large language models (LLMs). The framework combines Monte Carlo Tree Search with Self-Refine (SR-MCTS) to optimize the reasoning paths and utilizes a pairwise reward model to evaluate different paths globally. By leveraging the self-critique and rewriting capabilities of LLMs, our SR-MCTS overcomes the inefficiencies and limitations of conventional step-wise and greedy search algorithms, enabling a more efficient exploration of solution spaces. To guide the search process, we propose the Pairwise Preference Reward Model (PPRM), which predicts pairwise preferences between solutions through instruction-following capabilities trained by Reinforcement Learning from Human Feedback (RLHF). Finally, the Enhanced Borda Count (EBC) method is adopted to synthesize pairwise preferences into global quantile scores for evaluations. This approach mitigates the challenges of scoring variability and non-independent distributions in mathematical reasoning tasks. The framework has been tested on general and advanced benchmarks, showing superior search efficiency and performance compared to existing open-source and closed-source methods, particularly in complex Olympiad-level benchmarks, including AIME24 and AMC23.

NAACL 2025

LLaMA-Berry: Pairwise Optimization for Olympiad-level Mathematical Reasoning via O1-like Monte Carlo Tree Search

mcts

poster

### Welcome to 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics

Welcome to the 2025 meeting of the Nations of the Americas Chapter of the Association for Computational Linguistics! I am proud to help organize the first NAACL conference to carry the new name of our organization, one that emphasizes inclusion for all of the Americas. I am also pleased to welcome you to Albuquerque, New Mexico, a state whose unique blend of cultural influences will make for an excellent backdrop for NAACL 2025, especially with this year’s special theme on NLP in a Multicultural World. 
**[Continue reading...](https://drive.google.com/file/d/1jX-qGhqVSZZCIrAnJaz798pflu5Irrdn/view?usp=sharing)**

*- Colin Cherry, Google, NAACL 2025 General Chair* 

[![](https://assets.underline.io/markdown_image/1/image/b087f8a4dc5816d6e1a6514e59c59ac3.png)](https://drive.google.com/file/d/1T96GzPqObXrMln2BMByCSXTSizjTg69P/view?usp=sharing)

You need to log in with the email address you registered with. Access credentials have been sent to your email. 

Please be sure to check your spam and other email folders if you do not see an email confirmation right away.

Please log in to explore this event.

To access NAACL 2025 event page you are required to register. Please follow [**this link**](https://2025.naacl.org/registration/registration/) to register. Access will depend on your registration type.

Please register!

Welcome to 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics

In this work we systematically investigate how specific attributes of preference datasets affect the alignment and downstream performance of LLMs in instruction-following tasks. We use a novel synthetic data generation pipeline to generate 48,000 unique instruction-following prompts with combinations of 23 verifiable constraints that enable fine-grained and automated quality assessments of model responses. With our synthetic prompts, we use rejection sampling (RS) and Monte Carlo Tree Search (MCTS) to obtain preference pairs. Then, we perform experiments investigating the effects of (1) the presence of shared prefixes between the chosen and rejected responses, (2) the contrast and quality of the chosen, rejected responses and (3) the complexity of the training prompts. Our experiments reveal that shared prefixes provide marginal but consistent improvements and greater stability across challenging training configurations. While high-contrast preference pairs generally outperform low-contrast pairs, combining both often yields the best performance. Additionally, training on prompts of moderate difficulty leads to better generalization across different tasks. Our findings provide actionable insights into optimizing preference data curation for instruction-following tasks, offering a scalable and effective framework for enhancing LLM training and alignment.

A Systematic Examination of Preference Learning through the Lens of Instruction-Following

We introduce DateLogicQA, a human-curated benchmark of 190 questions specifically designed to understand temporal bias in Large Language Models (LLMs). Covering seven date formats across past, present, and future contexts, DateLogicQA examines four reasoning types: commonsense, factual, conceptual, and numerical. Through human-led evaluations of 12 state-of-the-art LLMs, we identify Representation-Level Bias, arising from suboptimal embeddings that distort date semantics, and Logical-Level Bias, manifesting when correct date tokens yield flawed temporal reasoning. Our findings underscore persistent challenges in handling various date formats and temporal contexts, revealing the need for more robust pretraining data, targeted post-training methods, and precise tokenization strategies. By illuminating these biases, we provide actionable insights to guide the development of LLMs for accurate temporal reasoning across diverse real-world applications.

DateLogicQA: Benchmarking Temporal Biases in Large Language Models

Large Language Models (LLMs) have recently demonstrated remarkable generative abilities, producing coherent text even with long input sequences. 
In the medical domain, LLMs have opened new research questions about their ability to interpret complex medical documents and extract useful information for a variety of purposes. 
Recent work has shown promising LLM performance on generation and summarization of clinical reports, doctor-patient interactions, patient progress notes, etc.
In this paper, we evaluate a suite of open-source models ranging from 1B to 70B parameters on a recent discharge report summarization dataset through in-context learning and fine-tuning. 
We show that zero-shot 8B parameter models outperform past zero-shot work on this dataset, and that efficient fine-tuning of small 3B parameter models on fewer than 100 examples perform as well as large 70B parameter models used few-shot.

Efficient LLM Adaptation for Long Clinical Text Summarization

Augmenting large language models (LLMs) with external context significantly improves their performance in natural language processing (NLP) tasks. However, LLMs struggle to answer queries reliably when the provided context lacks information, often resorting to ungrounded speculation or internal knowledge. Groundedness, or the ability to generate responses strictly supported by the context, is essential for ensuring factual consistency and trustworthiness. This study focuses on detecting whether a given query is grounded in a document supplied in context before the computationally expensive answer generation by LLMs. This will potentially reduce both inference time and resource consumption. We show that lightweight, task-specific encoder models such as RoBERTa, fine-tuned on curated datasets, can achieve accuracy comparable to state-of-the-art LLMs such as Llama3 8B and GPT4 in groundedness detection while reducing inference latency by orders of magnitude.

Small Encoders Can Rival Large Decoders in Detecting Groundedness

Structured data, such as tables, graphs, and databases, play a critical role in plentiful NLP tasks such as question answering and dialogue system. Recently, inspired by Vision-Language Models, Graph Neutral Networks (GNNs) have been introduced as an additional modality into the input of Large Language Models (LLMs) to improve their performance on Structured Knowledge Grounding (SKG) tasks. However, those GNN-enhanced LLMs have the following limitations: (1) They employ diverse GNNs to model varying types of structured data, rendering them unable to uniformly process various forms of structured data. (2) The pretraining of GNNs is coupled with specific LLMs, which prevents GNNs from fully aligning with the textual space and limits their adaptability to other LLMs. To address these issues, we propose **L**arge **L**anguage and **S**tructured Data **A**ssistant (LLaSA), a general framework for enhancing LLMs' ability to handle structured data. Specifically, we represent various types of structured data in a unified hypergraph format, and use self-supervised learning to pretrain a hypergraph encoder, and a G-Former compressing encoded hypergraph representations with cross-attention. The compressed hypergraph representations are appended to the serialized inputs during training and inference stages of LLMs. Experimental results on multiple SKG tasks show that our pretrained hypergraph encoder can adapt to various LLMs and enhance their ability to process different types of structured data. Besides, LLaSA, with LoRA fine-tuning, outperforms previous SOTA method using full parameters tuning.

LLaSA: Large Language and Structured Data Assistant

General-purpose language models (LMs) are aligned to diverse user intents, but fall short when it comes to specific applications. While finetuning is the default method for customized alignment, human annotations are often unavailable in various customization scenarios. Based on the observation that one of the main issues of LM customization is constraint adherence, we investigate the feasibility of using constraints as a bridge from general LMs to customized ones. We investigate common constraints in NLP tasks, categorize them into three classes based on the types of their arguments, and propose a unified framework, ACT (Aligning to ConsTraints), to automatically produce supervision signals for user alignment with constraints. Specifically, ACT uses constraint verifiers, which are typically easy to implement in practice, to compute constraint satisfaction rate (CSR) of each response. It samples multiple responses for each prompt and collect preference labels based on their CSR automatically. Subsequently, ACT adapts the LM to the target task through a ranking-based learning process. Experiments on fine-grained entity typing, abstractive summarization, and temporal question answering show that ACT is able to enhance LMs' capability to adhere to different classes of constraints, thereby improving task performance comparable to or approaching that of finetuning with labeled data.

Aligning to Constraints for Data-Efficient Language Model Customization

In clinical trial design, baseline feature selection is one of the crucial tasks for characterizing study cohorts and ensuring accurate study outcomes. Large Language Models (LLMs) show promise in automating this process by analyzing trial data and identifying key features. To assess the capabilities of LLMs in generating appropriate baseline features for clinical trials, we create two datasets: *CT-Repo*, which contains baseline features from 1,690 clinical trials sourced from clinicaltrials.gov, and *CT-Pub*, a curated subset of 100 clinical trials with more detailed baseline features extracted from published studies. In this paper, we consider GPT-4o and LLaMa3-70B-Instruct models in three configurations: zero-shot, three-shot with a fixed set of examples, and three-shot using an adaptive set of examples based on Retrieval-Augmented Generation (RAG) approach. We evaluate the model performance of baseline feature generation using the *LLM-as-a-Judge* framework. We further validate the LLM-as-a-judge evaluation on the CT-Pub dataset using assessments from human experts in a clinical trial. The results indicated that the RAG-based three-shot learning approach significantly improved performance by providing relevant, context-specific examples. This study marks an important initial advancement in using LLM for the robust design of clinical trials and observational studies.

Are Large Language Models Effective in Clinical Trial Design? A Study on Baseline Feature Generation

In this paper, we present HALLUCANA, a canary lookahead to detect and correct factual hallucinations of Large Language Models (LLMs) in long-form generation. HALLUCANA detects and intervenes as soon as traces of hallucination emerge, during and even before generation. To support timely detection, we exploit the internal factuality representation in the LLM hidden space, where we investigate various proxies to the LLMs’ factuality self-assessment, and discuss its relation to the models’ context familiarity from their pre-training. On biography generation, our method improves generation quality by up to 2.5x, while consuming over 6 times less compute.

HALLUCANA: Fixing LLM Hallucination with A Canary Lookahead

In this paper, we propose a system designed to process and interpret vague, open-ended, and multi-line complex natural language queries, transforming them into coherent, actionable data stories. Our system's modular architecture comprises five components—Question Generation, Answer Generation, NLG/Chart Generation, Chart2Text, and Story Representation—each utilizing LLMs to transform data into human-readable narratives and visualizations. Unlike prior art, our system uniquely addresses the ambiguity of vague, multi-line queries, setting a new benchmark in data storytelling by tackling complexities no existing system comprehensively handles. Our system is cost-effective which uses open-source models without extra training, and emphasizes transparency by showcasing end-to-end processing and intermediate outputs. This enhances explainability, builds user trust, and clarifies the data story generation process.

Goal-Driven Data Story, Narrations and Explanations

Many endangered languages are at risk of extinction due to barriers in communication and generational gaps that hinder their preservation. A cause for languages becoming endangered is the lack of language educational tools and artificial intelligence (AI) models for these low-resource languages. To address this, we propose the ATAIGI learning app designed with AI-powered models leveraging multimodal generative techniques. Our app offers users a comprehensive learning experience by providing translated phrases and definitions, example sentences, illustrative images, romanized pronunciation, and audio speech to accelerate language learning. ATAIGI is built on five AI models that are rigorously benchmarked individually, with our Transliteration Model achieving state-of-the-art results for Taiwanese Hokkien transliteration. ATAIGI is available for all to learn the endangered language of Taiwanese Hokkien, an endangered language spoken in Taiwan. A human evaluation conducted demonstrates the effectiveness of ATAIGI in improving language proficiency and cultural understanding, supporting its potential for the preservation and education of endangered languages like the Taiwanese Hokkien.

Downloads

Next from NAACL 2025

A Systematic Examination of Preference Learning through the Lens of Instruction-Following

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES