Singapore

Learning multimodal representation is a fundamental task that supports a wide range of applications such as visual-text retrieval. While pioneering approaches \textit{e.g.,} CLIP paves the way by learning separated encoders for different modalities, they struggle to model complex interactions between modalities, resulting in inferior vision and language representation. Recently, researchers have begun to leverage powerful Large Vision-Language Models (LVLMs) for unimodal or multimodal encoding, showing substantial improvement over separated encoder methods. However, we find that directly adapting LVLMs to embedding models suffers from insufficient visual representation and coarse multimodal alignment. To address these issues, we propose a simple yet effective Fine-grained Alignment Matters (FAM) method to achieve fine-grained vision-language embedding learning with LVLMs. First, to close the gap between the pure generation and multimodal embedding using LVLMs, we propose Multi-granularity Aligned Contrastive (MAC) to explicitly learn and align fine-grained modality representations at multiple granularity levels using image-text pairs. Second, to mitigate the insufficiency of visual representation during adapting LVLMs to downstream embedding tasks, we propose a Vision Embedding Inversion Training (VEIN) strategy to encourage the extracted embeddings to preserve fine-grained visual features. Extensive experiments demonstrate the effectiveness of our method, which achieves superior performance on various downstream multimodal datasets.

AAAI 2026

FAM: Fine-Grained Alignment Matters in Multimodal Embedding Learning with Large Vision-Language Models

ml: large multimodal models (lmms)

cv: representation learning for vision

cv: language and vision

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Neural Radiance Fields (NeRF) have shown remarkable capabilities
for photorealistic novel view synthesis. One major deficiency of
NeRF is that dense inputs are typically required, and the rendering
quality will drop drastically given sparse inputs. In this paper, we
highlight the effectiveness of rendered semantics from dense novel
views, and show that rendered semantics can be treated as a more
robust form of augmented data than rendered RGB. Our method
enhances NeRF’s performance by incorporating guidance derived
from the rendered semantics. The rendered semantic guidance encompasses
two levels: the supervision level and the feature level.
The supervision-level guidance incorporates a bi-directional verification
module that decides the validity of each rendered semantic
label, while the feature-level guidance integrates a learnable codebook
that encodes semantic-aware information, which is queried
by each point via the attention mechanism to obtain semanticrelevant
predictions. The overall semantic guidance is embedded
into a self-improved pipeline.We also introduce a more challenging
sparse-input indoor benchmark, where the number of inputs is
limited to as few as 6. Experiments demonstrate the effectiveness
of our method and it exhibits superior performance compared to
existing approaches.

Empowering Sparse-Input Neural Radiance Fields with Dual-Level Semantic Guidance from Dense Novel Views

Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, significantly lowering the barrier to deploying VLA models.

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Glass surfaces challenge object detection models as they mix the transmitted background with the reflected surrounding, creating confusing visual patterns. Previous methods relying on low-level cues (e.g., reflections and boundaries) or surrounding semantics are often unreliable in complex real-world scenarios. A glass image inherently comprises three distinct semantic components: semantics of the transmitted content, semantics of the reflected content, and semantics of the surrounding content. In this work, we observe that there is a relationship among these three types of semantics, where reflection semantics closely resembles surrounding semantics, while these two types of semantics tend to be different from the transmission semantics. For example, when on a street, we may see into a cafeteria through a glass wall, intermixed with reflection of the street, while the glass is surrounded by other street contents like shops and pedestrians, thereby creating a unique multi-semantic signature. Based on this observation, we propose the Multi-Semantic Net, MSNet, which identifies transmission, reflection, and surrounding semantics from glass images and exploits their relationships for glass surface detection. MSNet consists of two novel modules: (1) A Semantic Decomposition Module (SDM) containing Dual-Semantics Extraction Block (DSEB) to extract original image and reflection semantics and Semantic Elimination Block (SEB) to progressively derive transmission and surrounding semantics, and (2) An Adaptive Semantic Fusion Module (ASFM) to fuse these semantic components and adaptively learn their relationships to handle varying reflection conditions. Extensive experiments demonstrate that MSNet surpasses SOTA methods on public glass detection benchmarks.

Multi-Semantic Modeling for Glass Surface Detection in the Wild

Long Chain-of-Thought (CoT) reasoning has shown great promise in complex reasoning tasks, but its application to medical decision-making presents unique challenges.
Unlike structured tasks relying on static verification frameworks, medical decision-making requires dynamic validation through longitudinal clinical outcomes, exhibiting temporal-causal dependencies that complicate the verification of reasoning processes. 
Therefore, we introduce a novel data construction framework specifically designed for medical decision-making.
First, the framework analyzes real-world clinical cases to construct comprehensive timelines of medical events and identify critical decision points, including examination, diagnosis, and treatment.
Subsequently, the framework employs a clinical causality-aware strategy to generate decision-making questions at the identified critical decision points, along with reasoning traces and corresponding answers.
Finally, information drawn from future nodes serves as clinical logic-constrained criteria to re-evaluate and refine the soundness and coherence of the generated reasoning and responses.
Building on this, we present OncoCoT, an oncologic decision-making dataset derived from clinical records from the past four years across eight common cancer types.
Furthermore, we distill a subset of OncoCoT into a dedicated benchmark, OncoEval, to facilitate systematic evaluation of clinical reasoning capabilities in LLMs.
Evaluation results show that existing state-of-the-art reasoning models, such as Deepseek-r1 and GPT-o3, exhibit limited capability in addressing clinical problems in OncoEval, revealing room for improvement.

OncoCoT: A Temporal-causal Chain-of-Thought Dataset for Oncologic Decision-Making

How should we quantify the value of each training example when datasets are large, heterogeneous, and geometrically structured? Classical *Data‑Shapley* answers in principle, but its $O(n!)$ complexity and point‑wise perspective are ill‑suited to modern scales. We propose Hierarchical Contrastive Data Valuation (HCDV), a three‑stage framework that (i) learns a contrastive, geometry‑preserving representation, (ii) organises the data into a balanced coarse‑to‑fine hierarchy of clusters, and (iii) assigns Shapley‑style pay‑offs to *coalitions* via local Monte‑Carlo games whose budgets are propagated downward. HCDV collapses the factorial burden to $O\\bigl(T\sum_{\ell}K_{\ell}\bigr)=O(TK_{\max}\log n)$, rewards examples that sharpen decision boundaries, and regularises outliers through curvature‑based smoothness. We prove that HCDV approximately satisfies the four Shapley axioms with surplus loss $O(\eta\log n)$, enjoys sub‑Gaussian coalition deviation $\tilde{O}(1/\sqrt{T})$, and incurs at most $k\varepsilon_\infty$ regret for top‑$k$ selection. Experiments on four benchmarks — tabular, vision, streaming, and a 45 M‑sample CTR task — plus the OpenDataVal suite show that HCDV lifts accuracy by up to +5 pp, slashes valuation time by up to **$100\times$**, and directly supports tasks such as augmentation filtering, low‑latency streaming updates, and fair marketplace payouts.

From Points to Coalitions: Hierarchical Contrastive Shapley Values for Prioritizing Data Samples

Recent advances in deep learning have enabled highly accurate six-degree-of-freedom (6DoF) object pose estimation, driving its widespread use in real-world applications such as robotics, augmented reality, virtual reality, and autonomous systems. However, backdoor attacks pose a major security risk to deep learning models. By injecting malicious triggers into training data, an attacker can cause a model to perform normally on benign inputs but behave incorrectly under specific conditions. While most research on backdoor attacks has focused on 2D vision tasks, their impact on 6DoF pose estimation remains largely unexplored. Furthermore, unlike traditional backdoors that only change the object class, backdoors in 6DoF estimation must manipulate other dimensions, such as intrinsic camera parameters, translation, and rotation. Therefore, two-dimensional backdoor attack methods are completely ineffective against 6DoF pose estimation networks. To address this issue, we propose a novel backdoor attack framework to expose vulnerabilities in 6DoF pose estimation. Our framework uses synthetic and real 3D objects of varying shapes as triggers and assigns target poses to induce controlled erroneous pose outputs while maintaining normal behavior on clean inputs. We evaluated this attack on multiple models (including PVNet, DenseFusion, and PoseDiffusion) and datasets (such as Linemod, YCB-Video, and CO3D). Experimental results demonstrate that our approach achieves extremely high attack success rates without compromising performance on legitimate tasks. Across various models and objects, the backdoored network achieves up to 100% ADD accuracy on clean data, while also achieving 100% attack success rate (ASR) under trigger conditions. The accuracy of controlled erroneous pose output is also extremely high,as the attack sample achieves 97.70% ADD-P. These results demonstrate that the backdoor can be reliably implanted and activated, achieving a high attack success rate under trigger conditions and with negligible impact on benign data. Furthermore, we employ a simple defense method to demonstrate that our attack framework is undefendable. These findings reveal a potentially serious and previously underexplored threat to modern 6DoF pose estimation systems.

6DAttack: Backdoor Attacks in the 6DoF Pose Estimation

This work proposes a grammar-based chunking strategy that segments input streams into semantically complete units by parsing dependency relations (e.g., noun phrase boundaries, verb-object structures) and punctuation features. The method ensures chunk coherence and minimizes semantic fragmentation.

Building on this mechanism, we present SASST (Syntax-Aware Simultaneous Translation), an end-to-end framework integrating frozen Whisper encoder and decoder-only LLM. The unified architecture dynamically outputs translation tokens or <WAIT> symbols to jointly optimize translation timing and content, with target-side reordering addressing word-order divergence.

Experiments on CoVoST2 multilingual corpus (En to De/Zh/Ja) demonstrate significant translation quality improvements across languages, validating the effectiveness of syntactic structures in LLM-driven SimulST systems.

SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation

Conventional feedback, even when accompanied by brief explanations, rarely uncovers the hidden contradictions that trigger a learner's mistake. We bridge this gap with counterfactual question generation (CFQG): given a learner's answer, generate a follow-up question that deliberately contradicts it, compelling the learner to confront the underlying conflict. CFQG thus transforms assessment from passive scoring into an interactive and contradiction-centered dialogue that supports knowledge repair. To automate CFQG, we propose GapProbe, which probes the knowledge gap between a learner’s belief and curated facts through a knowledge graph (KG), then designs counterfactual questions (CFQs) that negate the belief. Identifying contradiction-aware triples, and more importantly, selecting those most likely to confuse the learner, are highly challenging in large-scale KGs. GapProbe tackles these challenges with an iterative ProConB cycle coupled with a schema-aware KGMap. By caching one- and multi-hop schema patterns of the KG, KGMap provides ``roadmap'' to guide LLMs jump to deep and contradiction-aware triples, beyond traditional step-wise graph traversal. We present the CFQG benchmark and corresponding metrics for evaluating how generated CFQs trigger, focus, and deepen learner reflection through explicit contradictions. 
Experiments on multiple datasets and LLMs show that GapProbe boosts LLM reasoning over KGs and generates follow-up questions that consistently promote deeper and more focused learner reflection.

Counterfactual Question Generation Uncovering Learner Contradictions

Predicting the distribution of future states in a stochastic system, known as belief propagation, is fundamental to reasoning under uncertainty. However, nonlinear dynamics often make analytical belief propagation intractable, requiring approximate methods. When the system model is unknown and must be learned from data, a key question arises: can we learn a model that (i) universally approximates general nonlinear stochastic dynamics, and (ii) supports analytical belief propagation? This paper establishes the theoretical foundations for a class of models that satisfy both properties. The proposed approach combines the expressiveness of normalizing flows for density estimation with the analytical tractability of Bernstein polynomials. Empirical results show the efficacy of our learned model over state-of-the art data-driven methods for belief propagation, especially for highly non-linear systems with non-additive, non-Gaussian noise.

Universal Learning of Stochastic Dynamics for Exact Belief Propagation Using Bernstein Normalizing Flows

Electric bicycles (e-bikes) have become the dominant mode of transportation in China’s urban instant delivery industry. However, many riders lack the experience to navigate complex traffic networks and diverse road conditions, leading to reduced delivery efficiency. To address this issue, we present Talking Trails, an e-bike delivery route planning system built upon an LLM-enhanced spatiotemporal trajectory model. Trained on millions of real-world delivery trajectories, fused with spatiotemporal and semantic data information, the model achieves a top-5 rider displacement prediction accuracy of 95\% and a route optimization rate of 82.1\%. In practice, we augment the core planner with an LLM-driven semantic layer that translates high-level user intent into executable tasks, then pair it with a battery-swap module that continuously validates route feasibility so the vehicle never runs out of charge mid-mission. Currently serving tens of thousands of riders, the system is projected to reduce average delivery mileage by 17\% and lower annual carbon emissions by 3978 tons. Overall, Talking Trails significantly improves delivery efficiency, offering a scalable and sustainable solution for instant delivery operations.

Downloads

Next from AAAI 2026

Empowering Sparse-Input Neural Radiance Fields with Dual-Level Semantic Guidance from Dense Novel Views

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Empowering Sparse-Input Neural Radiance Fields with Dual-Level Semantic Guidance from Dense Novel Views

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads