Singapore

Cross-lingual, cross-task transfer is challenged by task-specific data scarcity which becomes more severe as language support grows. 
Both challenges are amplified within vision-language models (VLMs).
We investigate multilingual generalization in encoder-decoder transformer VLMs to enable zero-shot image captioning in a language that was only paired with machine translations during training.
In this setting, the encoder must learn to generate generalizable, latent task-aware vision representations to instruct the decoder via inserted cross-attention layers.
We study scaling laws by training models based on Florence-2 and Gemma-2 that range from 0.4B to 11.2B parameters.
The training is performed on a synthetic dataset using varying compute budgets.
While all languages in the dataset have image-aligned translations, only a subset of them include image captions.
Notably, we show that captioning can emerge in a language after training on only translation data. 
We find that this indirect learning of unseen task-language pairs adheres to scaling laws that are governed by the multilinguality of the model, its model size and seen training samples.
Finally, we demonstrate that our observed scaling laws extend to a variety of downstream tasks, achieving competitive performance through finetuning in multimodal machine translation (Multi30K, CoMMuTE), lexical disambiguation (CoMMuTE), and image captioning (Multi30K, XM3600, COCO Karpathy).

AAAI 2026

Scaling Laws for Conditional Emergence of Multilingual Image Captioning via Generalization from Translation

scalability of ml systems

multimodal learning

language and vision

multi-task learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Composed Image Retrieval (CIR) is a flexible image retrieval paradigm that enables users to accurately locate the target image through a multimodal query composed of a reference image and modification text. Although this task has demonstrated promising applications in personalized search and recommendation systems, it encounters a severe challenge in practical scenarios known as the Noise Triplet Correspondence (NTC) problem. This issue primarily arises from the high cost and subjectivity involved in annotating triplet data. To address this problem, we identify two central challenges: the precise estimation of composed semantic discrepancy and the insufficient progressive adaptation to modification discrepancy. To tackle these challenges, we propose a cHrono-synergiA roBust progressIve learning framework for composed image reTrieval (HABIT), which consists of two core modules. First, the Mutual Knowledge Estimation Module quantifies sample cleanliness by calculating the Transition Rate of mutual information between the composed feature and the target image, thereby effectively identifying clean samples that align with the intended modification semantics. Second, the Dual-consistency Progressive Learning Module introduces a collaborative mechanism between the historical and current models, simulating human habit formation to retain good habits and calibrate bad habits, ultimately enabling robust learning under the presence of NTC. Extensive experiments conducted on two standard CIR datasets demonstrate that HABIT significantly outperforms state-of-the-art methods under various noise ratios, exhibiting superior robustness and retrieval performance.

HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval

Traffic flow prediction is a typical spatial-temporal prediction problem and has a wide range of applications. The core challenge lies in modeling the underlying complex spatial-temporal dependencies. Various methods have been proposed, and recent studies show that the modeling of dynamics is useful to meet the core challenge. While handling spatial dependencies and temporal dependencies using separate base model structures may hinder the modeling of spatial-temporal correlations, modeling of dynamics can bridge this gap. Incorporating spatial-temporal heterogeneity also advances the main goal, since it can extend the parameter space and incorporate more flexibility. Despite these advances, two limitations persist: 1) the modeling of dynamics is often limited to the dynamics of spatial topology (e.g., adjacency matrix changes), which, however, can be extended to a broader scope; 2) the modeling of heterogeneity is often separated for spatial and temporal dimensions, but this gap can also be bridged by the modeling of dynamics. To address the above limitations, we propose a novel framework for traffic prediction, called Meta Dynamic Graph (MetaDG). MetaDG leverages dynamic graph structures of node representations to explicitly model spatial-temporal dynamics. This generates both dynamic adjacency matrix and meta-parameters, extending dynamic modeling beyond topology while unifying the capture of spatial-temporal heterogeneity into a single dimension. Extensive experiments on four real-world datasets validate the effectiveness of MetaDG.

Meta Dynamic Graph for Traffic Flow Prediction

Recent self-supervised pre-training methods for object detection often rely on generic object proposals for localization and semantic feature learning for classification, but they yield limited improvements when applied to Detection Transformers (DETR) due to a lack of architectural alignment. Hence, we propose an elegant and versatile self-supervised framework tailored for DETR-like models called **Dis**tance-aware Multi-view **Co**ntrastive Learning (**DisCo DETR**). **DisCo DETR** enhances localization and semantic features through two core components. (i) **Distance-aware Multi-view Object Query Fusion** explicitly guides object queries to focus on spatially close objects across views, stabilizing training and improving localization accuracy. (ii) **Contrastive Learning for DETR** uses native bipartite matching to identify positive output pairs across views and pull them closer, enhancing semantic features discrimination with no extra matching. DisCo DETR can be seamlessly integrated into DETR-like models and achieves SOTA transfer performance on PASCAL VOC and COCO benchmarks across multiple variants.

DisCo DETR: Distance-aware Multi-view Contrastive Learning for DETR Pre-training

Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks through the \textbf{example-driven learning paradigm}. However, in high-stakes domains such as emergency response or industrial safety, real incidents are scarce, confidential, or both, while concise \emph{rule books} are plentiful.
We formalize this underexplored setting as \textbf{rule knowledge–driven reasoning} and ask: \emph{Can an LLM reason reliably when rules are abundant but examples are almost nil?}
To answer this question we introduce \textbf{RULER}, a fully automatic benchmark that derives 32K rigorously verified questions from 1K expert-curated emergency response rule knowledge to probe three core abilities—\emph{rule memorization}, \emph{single-rule application}, and \emph{multi-rule complex reasoning}, supported by a hallucination-aware evaluation suite and novel relational metrics.
A comprehensive empirical study of five open-source LLMs and five enhancement strategies shows that, after reliable performance on rule memorization and single-rule application, multi-rule complex reasoning plateaus at 5.4 on a 10-point scale.
We bridge this gap with \textbf{RAMPS}—a \textbf{R}ule-knowledge-\textbf{A}ware \textbf{M}onte-Carlo-tree-search \textbf{P}rocess-reward \textbf{S}upervision framework.
RAMPS injects rule knowledge priors into MCTS, distills 12K step-level traces without human annotation, and trains an advantage-based reward model that scores candidate reasoning paths during the beam search inference.
Experimental results demonstrate a notable improvement in complex reasoning, increasing to 7.7 (+2.3).
Together, RULER and RAMPS provide an automatic benchmark and a strong baseline suite for rule knowledge-driven reasoning in LLMs.

Benchmarking and Enhancing Rule Knowledge-Driven Reasoning of Large Language Models

Though promising in healthcare consultation applications, large language models (LLMs) face critical limitations in retaining and utilizing long-term memory across multi-turn interactions. In particular, existing memory enhancing paradigms are constrained by limited context windows and embedding-based retrieval, often failing to maintain task relevance and still suffering from memory prototype collapse in multi-turn healthcare consultation. To address these challenges, we propose a cognitively-inspired memory framework named MemoryART, which is grounded in Adaptive Resonance Theory (ART)—a cognitive and learning theory of how humans and animals adapt to dynamic environments. MemoryART employs three memory modules—working memory, episodic memory, and semantic memory to support task-aware memory organization and dynamic retrieval. Specifically, episodic memory provides the storage of specific experiences along with contextual clues, which is crucial for managing patient-specific information and perfect for multi-turn healthcare consultation interactions. Building upon this concept, MemoryART leverages multi-channel competitive learning and resonance matching to enable efficient and interpretable episodic memory encoding, alleviating issues of prototype collapse and noisy memory associations. For evaluation, we construct a long-term medical dialogue benchmark called MediLongChat using a LLM-based generation pipeline. The resulting dataset features realistic, multi-disease chat histories, each exceeding 100K tokens across 20–30 dialogues, simulating real-world healthcare interaction patterns. Our experimental results show that MemoryART outperforms mainstream approaches in memory-intensive tasks, achieving SOTA results and significantly reducing token consumption across five popular LLMs, confirming its effectiveness and efficiency in providing scalable, reliable memory for LLMs in healthcare. Code and datasets are available at \url{https://github.com/dairkkriad/MemoryART}

MemoryART: Enhancing LLMs via Multi-Memory Models with Adaptive Resonance Theory for Healthcare Agents

Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. 
FOEM approximates gradients by performing a first-order Taylor expansion around the pre-quantization weights. This yields an approximation based on the difference between latent and full-precision weights as well as the Hessian matrix. When substituted into the theoretical solution, the formulation eliminates the need to explicitly compute the Hessian, thereby avoiding the high computational cost and limited generalization of backpropagation-based gradient methods. This design introduces only minimal additional computational overhead.
Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 17.3\% and increases the 5-shot MMLU accuracy from 53.8\% achieved by GPTAQ to 56.1\%. Moreover, FOEM can be seamlessly combined with advanced techniques such as SpinQuant, delivering additional gains under the challenging W4A4KV4 setting and further narrowing the performance gap with full-precision baselines, surpassing existing state-of-the-art methods.

First-Order Error Matters: Accurate Compensation for Quantized Large Language Models

Generalized Category Discovery (GCD) aims to classify labeled instances from known categories while discovering novel categories from unlabeled data. Despite recent progress in GCD for computer vision, existing GCD approaches largely rely on static final-step representations (in the visual domain), overlooking the temporally evolving nature of time-series data. In this paper, we introduce TGCD, the first framework specifically designed for GCD in time-series data. TGCD leverages both the dynamics of latent representations and the heterogeneity of predictions across multiple temporal segments to disover unknown (i.e., novel) categories, based on a pre-trained time-series foundation model. We propose a unified learning objective for TGCD that integrates the following three components: (i) a Stochastic Temporal Segment Dropout (STeSD) objective that regularizes the model by selectively penalizing high-entropy segments to encourage confident predictions on uncertain regions of the time-series, and (ii) a Known–Unknown Temporal Discriminability (KUTD) objective that promotes representational separation between known and unknown categories within unlabeled data and (iii) a margin-aware classification objective to improve generalization. Empirical evaluation on six multivariate time-series data sets demonstrates that the TGCD substantially outperforms existing GCD methods, particularly in discovering unknown categories. We further conduct ablation studies to highlight the individual contributions of each component. Additionally, we provide the first comprehensive benchmarking of recent GCD approaches on time-series data, revealing the limitations of naive transfer and underscoring the benefits of temporal modeling.

TGCD: A Framework for Generalized Category Discovery in Time-Series Data

With the rapid development of large language models (LLMs), machine-generated texts have approached human writing quality, leading to four main text categories: purely machine-generated, machine-rewritten, machine-polished, and human-written content. Traditional detection methods face significant challenges in human-machine hybrid scenarios where LLMs perform rewriting or polishing, as existing approaches focus on single-level features and fail to capture subtle, multi-layered machine traces. To address this limitation, we propose a Multi-level Style Preference Optimization (MSPO) framework that captures machine-generated style features across multiple granularities: sequence-level optimization evaluates overall text style consistency, phrase-level detection identifies distinctive n-gram patterns, and lexical-level modeling captures word selection differences through probability distribution analysis. We further incorporate four text complexity indicators (Type-Token Ratio, Average Sentence Length, Average Word Length, and Punctuation Ratio) to dynamically adjust optimization parameters based on human-machine text complexity differences, enhancing adaptability across diverse text types. Additionally, we construct a comprehensive detection dataset spanning three representative domains (scientific writing, news, and creative writing) across four text types (human-written, purely machine-generated, machine-rewritten, and machine-polished), generated using state-of-the-art LLMs for robust evaluation. Experimental results demonstrate that MSPO significantly outperforms existing methods across generated, rewritten, and polished text detection tasks, with the most notable improvement of 0.156 AUROC points over baseline ImBD on challenging polished texts, while maintaining robust cross-domain generalizability.

Multi-level Style Preference Optimization: An Adaptive Detection Framework for Human-Machine Hybrid Text

The rapid advancement of large language models (LLMs) has revitalised research in Emotion Recognition in Conversation (ERC). However, existing LLM-based ERC approaches operate solely on textual input, whereas MLLM-based emotion recognition methods in non-conversational scenarios typically perform only basic multimodal fusion and fail to consider speaker-sensitive contextual dependencies, which limits their performance on ERC tasks. To integrate multimodal cues effectively and address their limitations in handling contextual dependencies, we propose a novel LLM-based framework, Causal-ERC, which captures context representations within each modality and incorporates them into the LLM. Moreover, experimental results show that LLMs perform poorly on long conversations. To improve LLMs' ability to model long conversations, we adjust corresponding causal prompts according to the causal type of each utterance. Experiments on two benchmark MERC datasets demonstrate that our Causal-ERC framework consistently outperforms existing state-of-the-art approaches and improves LLM's performance in long-context scenarios.

Causal-ERC: A Multimodal Framework with Causal Prompting for Emotion Recognition in Conversations with Large Language Models

RGB and infrared images has shown remarkable robustness for object detection based on unmanned aerial vehicles (UAV). However, the primitive RGB and infrared (IR) images are inevitably misaligned due to the device gap between RGB and infrared cameras. Most existing methods rely on manually filtered and aligned images, and thus are limited in real-world application. Some recent methods tend to directly learn from misaligned images, which only weakly benefit from the multi-modality and may be misled by dramatically misaligned IR images. Considering that the manually aligned images are available during training while unavailable in inference, we explore a new learning paradigm using the IR modality as privileged information. In the training stage, our model learns to hallucinate the complementary knowledge in IR modality based on RGB modality. In inference, our model could hallucinate the complementary IR modality to facilitate UAV detection. Specifically, we propose to quantize the IR features and hallucinate the codebook-indices based on RGB features, which is more effective and robust than directly hallucinating features. In addition, we propose to hierarchically hallucinate multi-scale codebook-indices, which could further improve the hallucinating quality. Experiments on DroneVehicle and VisDrone datasets demonstrate the effectiveness of our method.

Downloads

Next from AAAI 2026

HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES