Singapore

Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, etc. To bridge this gap, we introduce GeoX-Bench, a comprehensive Benchmark designed to explore and evaluate the capabilities of LMMs in cross-view Geo-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities. The GeoX-Bench will be released upon the publication.

AAAI 2026

GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

geo-localization

large multimodal models

pose estimation

benchmark

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Improving large language models (LLMs) for electronic health record (EHR) reasoning is essential for enabling accurate and generalizable clinical predictions. While LLMs excel at medical text understanding, they underperform on EHR-based prediction tasks due to challenges in modeling temporally structured, high-dimensional data. Existing approaches often rely on hybrid paradigms, where LLMs serve merely as frozen prior retrievers while downstream deep learning (DL) models handle prediction, failing to improve the LLM’s intrinsic reasoning capacity and inheriting the generalization limitations of DL models. To this end, we propose EAG-RL, a novel two-stage training framework designed to intrinsically enhance LLMs’ EHR reasoning ability through expert attention guidance, where expert EHR models refer to task-specific DL models trained on EHR data. Concretely, EAG-RL first constructs high-quality, stepwise reasoning trajectories using expert-guided Monte Carlo Tree Search to effectively initialize the LLM’s policy. Then, EAG-RL further optimizes the policy via reinforcement learning by aligning the LLM’s attention with clinically salient features identified by expert EHR models. Extensive experiments on two real-world EHR datasets show that EAG-RL improves the intrinsic EHR reasoning ability of LLMs by an average of 14.62%, while also enhancing robustness to feature perturbations and generalization to unseen clinical domains. These results demonstrate the practical potential of EAG-RL for real-world deployment in clinical prediction tasks.

Toward Better EHR Reasoning in LLMs: Reinforcement Learning with Expert Attention Guidance

Speech translation (ST) aims to translate speech from a source language into text in the target language. Naturally, speech signals contain paralinguistic cues beyond linguistic content, which could influence or even alter the interpretation of a lexically identical sentence, thereby yielding distinct translations. However, existing ST models lack direct and sufficient modeling of paralinguistic information, which limits their ability to perceive paralinguistic cues and understand speech comprehensively, leading to degraded translation performance. In response, we propose $\textbf{P}$ara$\textbf{L}$inguistic-$\textbf{a}$ware $\textbf{S}$peech $\textbf{T}$ranslation ($\textbf{PLaST}$), a novel dual-branch framework which directly leverages paralinguistic cues beyond the linguistic content. Specifically, PLaST employs a speech encoder and a style extractor to independently generate linguistic and paralinguistic representations, respectively. To obtain a purified linguistic representation aligned with the text representation, a hierarchical Optimal Transport ($\textbf{OT}$) is applied on the layer-wise outputs from an LLM decoder. Then, the paralinguistic information is retrieved and refined with an Attention-based Retrieval ($\textbf{AR}$) module, with the linguistic representation serving as queries to enable joint guidance for semantic understanding and translation generation. PLaST outperforms the strong baseline with an average of $\textbf{5.0}$ directional and $\textbf{4.5}$ global contrastive likelihood scores on the paralinguistic-sensitive benchmark ContraProST, demonstrating its superior capability in paralinguistic perception. Further experiments on the standard speech translation benchmark CoVoST-2 show that PLaST generalizes well to typical ST scenarios.
Codes and models will be made publicly available after the peer review process.

PLaST: Towards Paralinguistic-aware Speech Translation

In response to the growing demand, we introduce SmartyPat-Bench, a challenging, naturally expressed, and systematically labeled benchmark derived from real-world high-quality Reddit posts containing subtle logical fallacies. Unlike existing datasets and benchmarks, it provides more detailed annotations of logical fallacies and features more diverse data. 
To further scale up the study and address the limitations of manual data collection and labeling, such as fallacy-type imbalance and labor-intensive annotation, we introduce SmartyPat, an automated framework powered by logic programming-based oracles. 
SmartyPat utilizes Prolog rules to systematically generate logically fallacious statements, which are then refined into fluent natural-language sentences by LLMs, ensuring precise fallacy representation. 
Extensive evaluation demonstrates that SmartyPat produces fallacies comparable in subtlety and quality to human-generated content and significantly outperforms baseline methods. 
Finally, experiments reveal insights into LLM capabilities, highlighting that while excessive reasoning steps hinder fallacy detection accuracy, structured reasoning enhances fallacy categorization performance.

Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-Based Test Oracles

Offline preference optimization offers a simpler and more stable alternative to RLHF for aligning language models. However, their effectiveness is critically dependent on ranking accuracy, a metric where further gains are highly impactful. This limitation arises from a fundamental problem that we identify and formalize as the Overfitting-Underfitting Dilemma: current margin designs cause models to apply excessive, wasteful gradients to correctly ranked samples (overfitting) while providing insufficient corrective signals for misranked ones (underfitting). To resolve this dilemma, we propose \textbf{Adaptive Margin-attached Preference Optimization (AMaPO)}, a simple yet principled algorithm. AMaPO employs an instance-wise adaptive margin, refined by Z-normalization and exponential scaling, which dynamically reallocates learning effort by amplifying gradients for misranked samples and suppressing them for correct ones. Extensive experiments on widely used benchmarks demonstrate that AMaPO not only achieves better ranking accuracy and superior downstream alignment performance, but targeted analysis also confirms that it successfully mitigates the core overfitting and underfitting issues.

AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment

Developing open-set classification methods capable of classifying in-distribution (ID) data while detecting out-of-distribution (OOD) samples is essential for deploying graph neural networks (GNNs) in open-world scenarios. Existing methods typically treat all OOD samples as a single class, despite real-world applications—especially high-stake settings like fraud detection and medical diagnosis—demanding deeper insights into OOD samples, including their probable labels. This raises a critical question: Can OOD detection be extended to OOD classification without true label information? To answer this question, we introduce a Coarse-to-Fine open-set Classification (CFC) method that leverages large language models (LLMs) for text-attributed graphs. CFC consists of three key components: (1) A coarse classifier that utilizes LLM prompts for OOD detection and outlier label generation; (2) A GNN-based fine classifier trained with OOD samples from (1) for enhanced OOD detection and ID classification; and (3) Refined OOD classification achieved through LLM prompts and post-processed OOD labels. Unlike methods relying on synthetic or auxiliary OOD samples, CFC employs semantic OOD data-instances that are genuinely out-of-distribution based on their inherent meaning, thus improving interpretability and practical utility. 
CFC enhances OOD detection by 10\% compared to state-of-the-art approaches on text-attributed graphs and in the text domain, while
achieving up to 70\% accuracy in OOD classification on graph datasets.

Coarse-to-Fine Open-Set Graph Node Classification with Large Language Models

Color harmonization adjusts the colors of an inserted object so that it perceptually matches the surrounding image, resulting in a seamless composite. The harmonization problem naturally arises in augmented reality (AR), yet harmonization algorithms are not currently integrated into AR pipelines because real-time solutions are scarce. In this work, we address color harmonization for AR by proposing a lightweight approach that supports on-device inference. For this, we leverage classical optimal transport theory by training a compact encoder to predict the Monge-Kantorovich transport map. We benchmark our algorithm against state-of-the-art methods and demonstrate that for real composite AR images our method achieves the best aggregated score. We release our dedicated AR dataset of composite images with pixel-accurate masks and data-gathering toolkit to support further data acquisition by researchers.

Lightweight Optimal-Transport Harmonization on Edge Devices

Core stability is a natural and well-studied notion for group fairness in multi-winner voting, where the task is to select a committee from a pool of candidates. We study the setting where voters either approve or disapprove of each candidate; here, it remains a major open problem whether a core-stable committee always exists. In this work, we develop an approach based on mixed-integer linear programming for deciding whether and when core-stable committees are guaranteed to exist. In contrast to SAT-based approaches popular in computational social choice, our method can produce proofs for a specific number of candidates independent of the number of voters. In addition to these computational gains, our program lends itself to a novel duality-based reformulation of the core stability problem, from which we obtain new existence results in special cases. Further, we use our framework to reveal previously unknown relationships between core stability and other desirable properties, such as notions of priceability.

On the Edge of Core (Non-)Emptiness: An Automated Reasoning Approach to Approval-Based Multi-Winner Voting

Recent multimodal large language models (MLLMs) still struggle with long document understanding due to two fundamental challenges: information interference from abundant irrelevant content, and the quadratic computational cost of Transformer-based architectures. Existing approaches primarily fall into two categories: token compression, which sacrifices fine-grained details; and introducing external retrievers, which increase system complexity and prevent end-to-end optimization. To address these issues, we conduct an in-depth analysis and observe that MLLMs exhibit a human-like coarse-to-fine reasoning pattern: early Transformer layers attend broadly across the document, while deeper layers focus on relevant evidence pages. Motivated by this insight, we posit that the inherent evidence localization capabilities of MLLMs can be explicitly leveraged to perform retrieval during the reasoning process, facilitating efficient long document understanding. To this end, we propose URaG, a simple-yet-effective framework that Unifies Retrieval and Generation within a single MLLM. URaG introduces a lightweight cross-modal retrieval module that converts the early Transformer layers into an efficient evidence selector, identifying and preserving the most relevant pages while discarding irrelevant content. This design enables the deeper layers to concentrate computational resources on pertinent information, dramatically improving both accuracy and efficiency. Extensive experiments demonstrate that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56\%. The code and model will be publicly available.

URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding

Reinforcement Learning (RL) has shown significant promise in developing autonomous navigation algorithms for complex environments. However, the direct application of RL policies trained in simulation to real-world scenarios often faces challenges due to the reality gap. This paper proposes a two-stage system incorporating a segmentation strategy and a bird’s-eye-view (BEV) representation to mitigate the domain gap between simulation and reality. In the first stage, the segmentation transforms sensor data into a simplified and interpretable representation of the surrounding area, facilitating transferability across different deployments. In the second stage, the agent navigates through the BEV map, which can be trained using a vectorized simulation environment---a setup that runs multiple parallel instances of the environment to provide a wide range of training scenarios. This vectorization enables rapid exposure to varied environmental conditions, thereby accelerating and diversifying the training of a deep RL agent to achieve optimal navigation behaviors while maintaining high-speed, in-bound trajectories. The segmentation is crucial because it supports generalization of the learned policy across different robotic platforms. The contribution of this paper lies in combining real-time semantic segmentation with a bird’s-eye-view navigation policy, resulting in a transferable and scalable framework for real-world deployment of RL-based navigation agents. Experimental results demonstrate that agents trained with this methodology exhibit robust navigation performance and adaptability in both simulated and real-world environments, validating the efficacy of combining vectorized simulation with real-world segmentation for practical robotic navigation.

Transferable RL for Real-World Navigation Using Semantic Segmentation and Bird’s-Eye View Abstraction

Stochastic sequential decision-making systems — such as
Markov decision processes and their variants — are
increasingly used in areas such as transportation,
healthcare, and communication. However, the ability to
explain these systems’ outputs to non-technical end users
has not kept pace with their widespread adoption. This
paper addresses that gap by extending prior work and
presenting a unified framework for generating causal
explanations of agent behavior in sequential
decision-making settings, grounded in the structural causal
model (SCM) paradigm. Our framework supports the generation
of multiple, semantically distinct explanations for agent
actions — capabilities that were previously unattainable.
In addition to introducing a novel taxonomy of explanations
for MDPs to guide empirical investigation, we develop both
exact and approximate causal inference methods within the
SCM framework. We analyze their applicability and derive
run-time bounds for each. This leads to the proposed
algorithm, MeanRESP, which operates flexibly across a
spectrum of approximations tailored to external
constraints. We further analyze the sample complexity and
error rates of approximate MeanRESP, and provide a detailed
comparison of its outputs — under varying definitions of
responsibility — with popular Shapley-value-based methods.
Empirically, we performed a series of experiments to
evaluate the practicality and effectiveness of the proposed
system, focusing on real-world computational demands and
the validity and reliability of metrics for comparing
approximate and exact causal methods. Finally, we present
two user studies that reveal user preferences for certain
types of explanations and demonstrate a strong preference
for explanations generated by our framework compared to
those from other state-of-the-art systems.

Downloads

Next from AAAI 2026

Toward Better EHR Reasoning in LLMs: Reinforcement Learning with Expert Attention Guidance

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Toward Better EHR Reasoning in LLMs: Reinforcement Learning with Expert Attention Guidance

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads