United States

In this paper, we introduce and apply Operations Research Question Answering (ORQA), a new benchmark to assess the generalization capabilities of Large Language Models (LLMs) in the specialized technical domain of Operations Research (OR). This benchmark is designed to evaluate whether LLMs can emulate the knowledge and reasoning skills of OR experts when given diverse and complex optimization problems. The dataset, crafted by OR experts, presents real-world optimization problems, requiring multi-step reasoning for their solutions. Our evaluations of various open-source LLMs, such as LLaMA 3.1, DeepSeek, and Mixtral, reveal their modest performance, indicating a gap in their aptitude to generalize to specialized technical domains. This work contributes to the ongoing discourse on LLMs&#39; generalization capabilities, providing insights for future research in this area. The dataset and evaluation code will be made available.

AAAI 2025

Evaluating LLM Reasoning in the Operations Research Domain with ORQA

snlp

language models

In this paper, we introduce and apply Operations Research Question Answering (ORQA), a new benchmark to assess the generalization capabilities of Large Language Models (LLMs) in the specialized technical domain of Operations Research (OR). This benchmark is designed to evaluate whether LLMs can emulate the knowledge and reasoning skills of OR experts when given diverse and complex optimization problems. The dataset, crafted by OR experts, presents real-world optimization problems, requiring multi-step reasoning for their solutions. Our evaluations of various open-source LLMs, such as LLaMA 3.1, DeepSeek, and Mixtral, reveal their modest performance, indicating a gap in their aptitude to generalize to specialized technical domains. This work contributes to the ongoing discourse on LLMs' generalization capabilities, providing insights for future research in this area. The dataset and evaluation code will be made available.

technical paper

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



In natural human-to-human conversations, participants often receive feedback signals from one another based on their follow-up reactions. These reactions can include verbal responses, facial expressions, changes in emotional state, and other non-verbal cues. Similarly, in human-machine interactions, the machine can leverage the user's follow-up utterances as feedback signals to assess whether it has appropriately addressed the user's request. Therefore, we propose using the likelihood of follow-up utterances as rewards to differentiate preferred responses from less favored ones, without relying on human or commercial LLM-based preference annotations. Our proposed reward mechanism, "Follow-up Likelihood as Reward" (FLR), matches the performance of strong reward models trained on large-scale human or GPT-4 annotated data on 8 pairwise-preference and 4 rating-based benchmarks. Building upon the FLR mechanism, we propose to automatically mine preference data from the online generations of a base policy model. The preference data are subsequently used to boost the helpfulness of the base model through direct alignment from preference (DAP) methods, such as direct preference optimization (DPO). Lastly, we demonstrate that fine-tuning the language model that provides follow-up likelihood with natural language feedback significantly enhances FLR's performance on reward modeling benchmarks and effectiveness in aligning the base policy model's helpfulness.

Aligning Language Models Using Follow-up Likelihood as Reward Signal

Trajectory prediction models that can infer both future trajectories and their associated uncertainties of the target vehicles is crucial for safe and robust navigation and path planning of autonomous vehicles. However, the majority of existing trajectory prediction models have neither considered reducing the uncertainty as one objective during the training stage nor provided reliable uncertainty quantification during inference stage, especially under potential distribution shift. Therefore, in this paper, we propose the Conformal Uncertainty Quantification under Distribution Shift framework, CUQDS, to quantify the uncertainty of the predicted trajectories of existing trajectory prediction models under potential data distribution shift, while improving the prediction accuracy of the models and reducing the estimated uncertainty during the training stage. Specifically, CUQDS includes 1) a learning-based Gaussian process regression module that models the output distribution of the base model (any existing trajectory prediction neural networks) and reduces the estimated uncertainty by an additional loss term, and 2) a statistical-based Conformal P control module to calibrate the estimated uncertainty from the Gaussian process regression module in an online setting under potential distribution shift between training and testing data.  Experimental results on various state-of-the-art methods using benchmark motion forecasting datasets demonstrate the effectiveness of our proposed design.

CUQDS: Conformal Uncertainty Quantification under Distribution Shift for Trajectory Prediction

Vision-Language models (VLMs) have shown great potential in enhancing open-world visual concept comprehension. Recent researches focus on an optimum multimodal collaboration strategy that significantly advances CLIP-based few-shot tasks. However, existing prompt-based solutions suffer from unidirectional information flow and increased parameters since they explicitly condition the vision prompts on textual prompts across different transformer layers using non-shareable coupling functions. To address this issue, we propose a Dual-shared mechanism based on LoRA (DsRA) that addresses VLM adaptation in low-data regimes. The proposed DsRA enjoys several merits. First, we design an inter-modal shared coefficient that focuses on capturing visual and textual shared patterns, ensuring effective mutual synergy between image and text features. Second, an intra-modal shared matrix is proposed to achieve efficient parameter fine-tuning by combining the different coefficients to generate layer-wise adapters placed in encoder layers. Our extensive experiments demonstrate that DsRA improves the generalizability under few-shot classification, base-to-new generalization, and domain generalization settings. Our code will be released soon.

Exploring the Better Multimodal Synergy Strategy for Vision-Language Models

The gait, as a kind of soft biometric characteristic, can reflect the distinct walking patterns of individuals at a distance, exhibiting a promising technique for unrestrained human identification. With largely excluding gait-unrelated cues hidden in RGB videos, the silhouette and skeleton, though visually compact, have acted as two of the most prevailing gait modalities for a long time. Recently, several attempts have been made to introduce more informative data forms like human parsing and optical flow images to capture gait characteristics, along with multi-branch architectures. However, due to the inconsistency within model designs and experiment settings, we argue that a comprehensive and fair comparative study among these popular gait modalities, involving the representational capacity and fusion strategy exploration, is still lacking. From the perspectives of fine vs. coarse-grained shape and whole vs. pixel-wise motion modeling, this work presents an in-depth investigation of three popular gait representations, i.e., silhouette, human parsing, and optical flow, with various fusion evaluations, and experimentally exposes their similarities and differences. Based on the obtained insights, we further develop a Reaggregation-and-Distinguish Fusion (ReDiFusion) strategy, consequently building our new framework MultiGait++. ReDiFusion preserves commonalities while highlighting differences to enrich the gait features. To verify our findings and conclusions, extensive experiments on CCPG and SUSTech1K are conducted. The code will be available.

Exploring More from Multiple Gait Modalities for Human Identification

$\textbf{Keyword:}$ PRS: Planning with Language Models

Large Language Models (LLMs) have shown promise in solving natural language-described planning tasks, but their direct use often leads to inconsistent reasoning and hallucination. While hybrid LLM-symbolic planning pipelines have emerged as a more robust alternative, they typically require extensive expert intervention to refine and validate generated action schemas. It not only limits scalability but also introduces a potential for biased interpretation, as a single expert's interpretation of ambiguous natural language descriptions might not align with the user's actual intent. To address this, we propose a novel approach that constructs an action schema library to generate multiple candidates, accounting for the diverse possible interpretations of natural language descriptions. We further introduce a semantic validation and ranking module that automatically filter and rank these candidates without expert-in-the-loop. The experiments showed our pipeline maintains superiority in planning over the direct LLM planning approach. These findings demonstrate the feasibility of a fully automated end-to-end LLM-symbolic planner that requires no expert intervention, opening up the possibility for a broader audience to engage with AI planning with less prerequisite of domain expertise.

Planning in the Dark: LLM-Symbolic Planning Pipeline Without Experts

Translation elongation is essential for cellular proteostasis and is implicated in cancer and neurodegeneration. Accurately predicting the rate of ribosome elongation in each codon (also called ribosomal A site) on mRNA is important for understanding and modulating protein synthesis, which can lead to innovative approaches in treating various diseases. However, predicting elongation rates is challenging due to the trade-off between capturing distal codon interactions and focusing on proximal codon effects at the A site.
Approaches capturing distal codon interactions in the coding sequences (CDS) of mRNA fail to effectively differentiate critical regions (codons near the A site) due to insufficient effective mechanisms for focusing on these regions. Conversely, due to the limitations of models when handling long mRNA sequences, some methods simplify inputs by conditioning solely on proximal codons surrounding the A site, leading to the loss of important information from distal codons. To address this issue, we leverage Mamba's success in capturing long-range dependencies to enable the consideration of distant codons' impact on the A site. Additionally, we introduce a sliding window attention mechanism to emphasize the proximal codons around the A site during ribosome elongation. Building on these advancements, we present Sliding Window Attention Mamba (SWAMamba), a novel framework that simultaneously leverages both proximal and distal codon effects on the A site. We conduct comprehensive evaluations on ribosome data across four species and find that SWAMamba significantly outperformed current state-of-the-art methods in predicting translation elongation rates.

SWAMamba: A Sliding Window Attention Mamba Framework for Predicting Translation Elongation Rates

The key to semi-supervised semantic segmentation lies in how to fully exploit a large amount of unlabeled data to improve the model’s generalization performance. Most methods are lured into the trap of taking each class independently (i.e., class-independent consistency) and neglecting the fact that there exist semantic dependencies among classes. In this paper, we analyze the bottlenecks of class-independent consistency inherent in previous methods and offer a fresh perspective of cooperative game theory to explicitly encourage class-consensus alignment (i.e., class-consensus consistency) between the teacher (weak augmented view) and student network (strong augmented view). We formulate classes as players in an interactive game to model their interpretable consensus and shed light on the possibility of closer collaboration between consensus themselves and consistency regularization, yielding more comprehensive and effective supervision signals. To this end, we carefully design the class-consensus consistency without introducing any external knowledge to model class structure information which renders better interpretability, and further, prepend relaxed class-consensus consistency (RCC) to unlock the potential of modeling class consensus by relaxing the strict alignment of direct class consensus values to ranking alignment. Extensive experimental results on multiple benchmarks including mitochondria segmentation demonstrate that RCC performs favorably against state-of-the-art methods. Particularly in the low-data regimes, RCC achieves significant improvements. Code and models will be made available to facilitate future research.

Relaxed Class-consensus Consistency for Semi-supervised Semantic Segmentation

Emotion recognition based on body movements is vital in human-computer interaction. However, existing emotion recognition methods predominantly focus on enhancing classification accuracy, often neglecting the provision of textual explanations to justify their classifications. In this paper, we propose an Emotion-Action Interpreter powered by LargeLanguage Model (EAI-LLM), which not only recognizes emotions but also generates textual explanations by treating 3D body movement data as unique input tokens within large language models (LLMs). Specifically, we propose a multi-granularity skeleton tokenizer designed for LLMs, which separately extracts spatio-temporal tokens and semantic tokens from the skeleton data. This approach allows LLMs to generate more nuanced classification descriptions while maintaining robust classification performance. Furthermore, we treat the skeleton sequence as a specific language and propose a unified skeleton token module. This module leverages the extensive background knowledge and language processing capabilities of LLMs to address the challenges of joint training on heterogeneous datasets, thereby significantly enhancing recognition accuracy on individual datasets. Experimental results demonstrate that our model achieves recognition accuracy comparable to existing methods. More importantly, with the support of background knowledge from LLMs, our model can generate detailed emotion descriptions based on classification results, even when trained on a limited amount of labeled skeleton data.

Understanding Emotional Body Expressions via Large Language Models

Abstraction is an important and useful concept in the field of artificial intelligence. To the best of our knowledge, there is no syntactic method to compute sound and complete abstraction from a given low-level action theory and a refinement mapping. This paper aims to address this issue. To this end, we first present a variant of situation calculus, namely linear integer situation calculus, which serves as the framework of high-level basic action theory. We then migrate Banihashemi, De Giacomo, and Lesperance’s abstraction framework to one from extended situation calculus to linear integer situation calculus. Finally, we impose some restrictions on refinement mappings and design a syntactic approach to computing sound and complete abstraction from low-level basic action theories and restricted refinement mappings.

A Syntactic Approach to Computing Complete and Sound Abstraction in the Situation Calculus

Existing local model-agnostic 
explanation techniques are ineffective for machine learning models that consider inputs of variable lengths, as they do not consider temporal information embedded in these models. 
To address this limitation, we propose ReX, a general framework for incorporating temporal information in these techniques.
Our key insight is that these techniques typically learn a model surrogate by sampling model inputs and outputs, and we can incorporate temporal information in a uniform way by only changing the sampling process and the surrogate features. 
We instantiate our approach on three popular explanation techniques: Anchors, LIME, and Kernel SHAP. 
To evaluate the effectiveness of ReX, we apply our approach to six models in three different tasks.
Our evaluation results demonstrate that our approach 1) significantly improves the fidelity of explanations, making model-agnostic techniques outperform a state-of-the-art model-specific technique on its target model, and 2) helps end users better understand the models' behaviors.

Premium content

Downloads

Next from AAAI 2025

Aligning Language Models Using Follow-up Likelihood as Reward Signal

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES