Singapore

The growing exploration of Large Language Models (LLM) and Vision-Language Models (VLM) has opened avenues for enhancing the effectiveness of reinforcement learning (RL). However, existing LLM-based RL methods often focus on the guidance of control policy and encounter the challenge of limited representations of the backbone networks. To tackle this problem, we introduce Enhanced Semantic Motion Representations (Semore), a new VLM-based framework for visual RL, which can simultaneously extract semantic and motion representations through a dual-path backbone from the RGB flows. Semore utilizes VLM with common-sense knowledge to retrieve key information from observations, while using the pre-trained clip to achieve the text-image alignment, thereby embedding the ground-truth representations into the backbone. To efficiently fuse semantic and motion representations for decision-making, our method adopts a separately supervised approach to guide the extraction of semantics and motion respectively, while allowing them to interact spontaneously to enhance each other. A series of experiments demonstrate that, under the guidance of VLM at the feature level, our method exhibits efficient and adaptive ability compared to state-of-art approaches. All source code will be made publicly available.

AAAI 2026

Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Contrastive Language-Image Pretraining (CLIP) has demonstrated impressive generalization on vision-language tasks by aligning images and short texts. However, its inherent 77-token length constraint limits the capacity to capture complex semantics in long captions. Existing long-text adaptations for CLIP typically rely on either multi-stage training or truncation-based alignment, both inevitably resulting in semantic degradation and cumbersome tuning. Therefore, we propose One-Stage Long-Text Adaptation for CLIP (OneLIP), a unified framework that extends CLIP to understand long captions within a single training stage, eliminating the need for brittle truncation or multi-stage pipelines. OneLIP addresses semantic degradation by introducing two key innovations: (1) Token Refinement and Importance-guided Modeling (TRIM) module, which selects and refines informative tokens via SVD-based contribution scoring and cross-modal relevance modeling; (2) Per-sample Online Hard Negative Mining (PO-HNM) strategy dynamically maintains sample-specific negatives based on dual-consistency difficulty tracking, which is superior to other strategies in long-text scenarios where key semantics are distributed in more scattered positions. Extensive experiments on long-text image retrieval, short-text image retrieval, zero-shot classification, and text-to-image generation demonstrate OneLIP's robustness and versatility across diverse input lengths, offering a semantically faithful solution for long-text adaptation for CLIP.

OneLIP: Unlocking and Improving Long-Text Representations of CLIP via One-Stage Adaptation

Assessing enterprise resilience under uncertainty necessitates capturing both intrinsic attributes and evolving inter-enterprise dependencies. However, real-world enterprise systems pose substantial structural challenges: redundant or loosely correlated links can trigger spurious relational inferences, while missing or latent dependencies often hinder the propagation of informative signals. Moreover, most existing approaches adopt static graph priors or decouple structural refinement from semantic learning, lacking a co-evolutionary paradigm that allows structure and representation to inform one another. We propose **CFU**, a novel **C**o-evolving **F**ramework under **U**ncertainty, which reconceptualizes graph structure as a dynamic and learnable component evolving alongside node semantics. CFU begins with a structure-aware contrastive pretraining phase to distill latent relational semantics without supervision. It then performs bidirectional structural refinement, filtering structurally redundant edges through semantic agreement scoring, and uncovering temporally contingent, task-relevant dependencies via similarity-guided inference. These operations are integrated through a dynamic fusion process that continuously aligns the evolving topology with the resilience objective. By embedding structural adaptation within the learning loop, CFU enables context-aware resilience assessment across incomplete, ambiguous, and structurally volatile enterprise environments. Extensive experiments on real-world datasets demonstrate its superior performance across diverse evaluation scenarios.

Beyond Graph Priors: A Co-Evolving Framework Under Uncertainty for Enterprise Resilience Assessment

World models have been developed to support sample-efficient deep reinforcement learning agents. However, it remains challenging for world models to accurately replicate environments that are high-dimensional, non-stationary, and composed of multiple objects with rich interactions since most world models learn holistic representations of all environmental components. By contrast, humans perceive the environment by decomposing it into discrete objects, facilitating efficient decision-making. Motivated by this insight, we propose *Slot Transformer Imagination with CAusality-aware reinforcement learning* (STICA), a unified framework in which object-centric Transformers serve as the world model and causality-aware policy and value networks. STICA represents each observation as a set of object-centric tokens, together with tokens for the agent action and the resulting reward, enabling the world model to predict token-level dynamics and interactions. The policy and value networks then estimate token-level cause--effect relations and use them in the attention layers, yielding causality-guided decision-making. Experiments on object-rich benchmarks demonstrate that STICA consistently outperforms state-of-the-art agents in both sample efficiency and final performance.

Object-Centric World Models for Causality-Aware Reinforcement Learning

Large language models (LLMs) exhibit logically inconsistent hallucinations that appear coherent yet violate reasoning principles, with recent research suggesting an inverse relationship between causal reasoning capabilities and such hallucinations. However, existing reasoning approaches in LLMs, such as Chain-of-Thought (CoT) and its graph-based variants, operate at the linguistic token level rather than modeling the underlying causal relationships between variables, lacking the ability to represent conditional independencies or satisfy causal identification assumptions. To bridge this gap, we introduce causal-DAG construction and reasoning (CDCR-SFT), a supervised fine-tuning framework that trains LLMs to explicitly construct variable-level directed acyclic graph (DAG) and then perform reasoning over it. Moreover, we present a dataset comprising 25,368 samples (CausalDR), where each sample includes an input question, explicit causal DAG, graph-based reasoning trace, and validated answer. Experiments on four LLMs across eight tasks show that CDCR-SFT improves the causal reasoning capability with the state-of-the-art 95.33% accuracy on CLADDER (surpassing human performance of 94.8\% for the first time) and reduces the hallucination on HaluEval with 10% improvements. It demonstrates that explicit causal structure modeling in LLMs can effectively mitigate logical inconsistencies in LLM outputs. Code at https://github.com/MrLYG/CDCR-SFT.

Mitigating Hallucinations in Large Language Models via Causal Reasoning

Although Generative Flow Networks (GFlowNets) are designed to capture multiple modes of a reward function, they often suffer from mode collapse in practice, getting trapped in early-discovered modes and requiring prolonged training to find diverse solutions. Existing exploration techniques often rely on heuristic novelty signals. We propose Loss-Guided GFlowNets (LGGFN), a novel approach where an auxiliary GFlowNet's exploration is directly driven by the main GFlowNet's training loss. By prioritizing trajectories where the main model exhibits high loss, LGGFN focuses sampling on poorly understood regions of the state space. This targeted exploration significantly accelerates the discovery of diverse, high-reward samples. Empirically, across diverse benchmarks including grid environments, structured sequence generation, Bayesian structure learning, and biological sequence design, LGGFN consistently outperforms baselines in exploration efficiency and sample diversity..For instance, on a challenging sequence generation task, it discovered over 40 times more unique valid modes while simultaneously reducing the exploration error metric by approximately 99\%.

Loss-Guided Auxiliary Agents for Overcoming Mode Collapse in GFlowNets

With the rapid rise of social media and Internet culture, memes have become a popular medium for expressing emotional tendencies. This has sparked growing interest in Meme Emotion Understanding (MEU), which aims to classify the emotional intent behind memes by leveraging their multimodal contents. While existing efforts have achieved promising results, two major challenges remain: (1) a lack of fine-grained multimodal fusion strategies, and (2) insufficient mining of memes’ implicit meanings and background knowledge. To address these challenges, we propose $\textit{MemoDetector}$, a novel framework for advancing MEU. First, we introduce a four-step textual enhancement module that utilizes the rich knowledge and reasoning capabilities of Multimodal Large Language Models (MLLMs) to progressively infer and extract implicit and contextual insights from memes. These enhanced texts significantly enrich the original meme contents and provide valuable guidance for downstream classification. Next, we design a dual-stage modal fusion strategy: the first stage performs shallow fusion on raw meme image and text, while the second stage deeply integrates the enhanced visual and textual features. This hierarchical fusion enables the model to better capture nuanced cross-modal emotional cues. Experiments on two datasets, MET-MEME and MOOD, demonstrate that our method consistently outperforms state-of-the-art baselines. Specifically, $\textit{MemoDetector}$ improves F1 scores by 4.3\% on MET-MEME and 3.4\% on MOOD. Further ablation studies and in-depth analyses validate the effectiveness and robustness of our approach, highlighting its strong potential for advancing MEU. Our code is available at https://github.com/singing-cat/MemoDetector.

Enhancing Meme Emotion Understanding with Multi-Level Modality Enhancement and Dual-Stage Modal Fusion

Text-to-Image (T2I) generation models have achieved significant advancements. Correspondingly, many automated methods emerge to evaluate the image-text alignment capabilities of generative models. However, the performance comparison among these automated methods is constrained by the limited scale of existing datasets. Additionally, existing datasets lack the capacity to assess the performance of automated methods at a fine-grained level. In this study, we contribute an EvalMuse-40K dataset, gathering 40K image-text pairs with fine-grained human annotations for image-text alignment-related tasks. In the construction process, we employ various strategies such as balanced prompt sampling and data re-annotation to ensure the diversity and reliability of our dataset. This allows us to comprehensively evaluate the performance of image-text alignment methods for T2I models. Based on this dataset, we introduce an efficient automated evaluation method termed FGA-BLIP2, which enables Fine-Grained Alignment evaluation solely by inputting images and text leveraging BLIP2, without visual question answering for each fine-grained element. Experimental results show the proposed FGA-BLIP2 efficiently achieves good performance on multiple image-text alignment datasets. Meanwhile, benefiting from the high efficiency and fine-grained evaluation capability of FGA-BLIP2, we apply it as a reward model to improve text-to-image models, which effectively enhances the image-text alignment ability of text-to-image models.

EvalMuse-40K: A Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Alignment Evaluation

Log parsing converts semi-structured logs into structured templates, forming a critical foundation for downstream analysis. Traditional syntax and semantic-based parsers often struggle with semantic variations in evolving logs and data scarcity stemming from their limited domain coverage. Recent large language model (LLM)-based parsers leverage in-context learning (ICL) to extract semantics from examples, demonstrating superior accuracy. However, LLM-based parsers face two main challenges: 1) underutilization of ICL capabilities, particularly in dynamic example selection and cross-domain generalization, leading to inconsistent performance; 2) time-consuming and costly LLM querying. To address these challenges, we present MicLog, the first progressive meta in-context learning (ProgMeta-ICL) log parsing framework that combines meta-learning with ICL on small open-source LLMs (i.e., Qwen-2.5-3B). Specifically, MicLog: i) enhances LLMs' ICL capability through a zero-shot to k-shot ProgMeta-ICL paradigm, employing weighted DBSCAN candidate sampling and enhanced BM25 demonstration selection; ii) accelerates parsing via a multi-level pre-query cache that dynamically matches and refines recently parsed templates. Evaluated on Loghub-2.0, MicLog achieves 10.3\% higher parsing accuracy than the state-of-the-art parser while reducing parsing time by 42.4\%.

MicLog: Towards Accurate and Efficient LLM-based Log Parsing via Progressive Meta In-Context Learning

Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.

DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation

We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion planners and reinforcement learning-based physics controllers. While diffusion models can generate diverse and expressive motions conditioned on text and scene context, they often produce physically implausible outputs, leading to execution drift during simulation. To address this, BRIC dynamically adapts the physics controller to noisy motion plans at test time, while preserving pre-trained skills via a loss function that mitigates catastrophic forgetting. In addition, BRIC introduces a lightweight test-time guidance mechanism that steers the diffusion model in the signal space without updating its parameters.
By combining both adaptation strategies, BRIC ensures consistent and physically plausible long-term executions across diverse environments in an effective and efficient manner. We validate the effectiveness of BRIC on a variety of long-term tasks, including motion composition, obstacle avoidance, and human-scene interaction, achieving state-of-the-art performance across all tasks.

Premium content

Next from AAAI 2026

OneLIP: Unlocking and Improving Long-Text Representations of CLIP via One-Stage Adaptation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES