Singapore

The current diffusion-based Vision-Language-Action (VLA) model has faster inference speed and the ability to solve the action muti-modality problem in robot manipulation tasks compared to traditional autoregressive models after large-scale pre-training and post-training. However, the diffusion-based VLA model was found to have poor instruction-following ability, and after fine-tuning training on multiple tasks, it often suffers from &quot;skill forgetting&quot; due to conflicting model weights on each task. To address this problem, we propose DiTEA, a Diffusion Transformer-based Mixture-of-Experts(MoE) VLA model. Specifically, it fuses the MoE module into the action head of VLA to form Action MoE, and in addition, we design the Task-Instruction Gate, which uses language instructions to select specific experts for tasks they specialize in, in order to improve the VLA&#39;s instruction-following ability. We conducted comprehensive experiments and ablation study to evaluate the efficacy of our model under different designs. Experimental results from simulation and real-world show that our DiTEA has excellent improvement in multi-task compared to baseline and other VLAs.

AAAI 2026

DiTEA: Mixture-of-Experts for Vision-Language-Action Model in Robotic Manipulation

vision-language-action model

robotic manipulation

mixture-of-experts

The current diffusion-based Vision-Language-Action (VLA) model has faster inference speed and the ability to solve the action muti-modality problem in robot manipulation tasks compared to traditional autoregressive models after large-scale pre-training and post-training. However, the diffusion-based VLA model was found to have poor instruction-following ability, and after fine-tuning training on multiple tasks, it often suffers from "skill forgetting" due to conflicting model weights on each task. To address this problem, we propose DiTEA, a Diffusion Transformer-based Mixture-of-Experts(MoE) VLA model. Specifically, it fuses the MoE module into the action head of VLA to form Action MoE, and in addition, we design the Task-Instruction Gate, which uses language instructions to select specific experts for tasks they specialize in, in order to improve the VLA's instruction-following ability. We conducted comprehensive experiments and ablation study to evaluate the efficacy of our model under different designs. Experimental results from simulation and real-world show that our DiTEA has excellent improvement in multi-task compared to baseline and other VLAs.

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

This paper aims to interactively generate and edit disentangled 3D characters based on precise user instructions. Existing methods generate and edit 3D characters via rough and simple editing guidance and entangled representations, making it difficult to achieve precise and comprehensive control over fine-grained local editing and free clothing transfer for characters. To enable accurate and intuitive control over the generation and editing of high-quality 3D characters with freely interchangeable clothing, we propose a novel user-interactive approach for disentangled 3D character creation. Specifically, to achieve precise control over 3D character generation and editing, we introduce two user-friendly interaction approaches: a sketch-based layered character generation/editing method, which supports clothing transfer; and a 3D-proxy-based part-level editing method, enabling fine-grained disentangled editing. To enhance 3D character quality, we propose a 3D Gaussian reconstruction strategy guided by geometric priors, ensuring that 3D characters exhibit detailed local geometry and smooth global surfaces. Extensive experiments on both public datasets and in-the-wild data demonstrate that our approach not only generates high-quality disentangled 3D characters but also supports precise and fine-grained editing through user interaction.

InterCoser: Interactive 3D Character Creation with Disentangled Fine-Grained Features

Just recognizable distortion (JRD) has been introduced for image compression for machines, aiming to quantify the maximum coding distortion that can be tolerated by a specific perception model, thereby defining the upper bound of machine vision redundancy (MVR). However, existing JRD-based redundancy estimation methods face three key challenges: limited dataset annotation accuracy, low prediction efficiency, and insufficient perception accuracy, all of which hinder their practical deployment. To address these limitations, we propose a new MVR-Net, a frame-wise efficient JRD prediction method that generates the optimal encoding quantization map in a single inference pass. Furthermore, we refine the annotation standard for JRD datasets based on experimental insights, enhancing the precision of recognizable redundancy measurement. Compared to stateof-the-art methods, MVR-Net achieves a superior balance between bitrate reduction and perception accuracy in JRD-guided compression, while offering up to a 40,000× speed improvement, demonstrating its practicality and efficiency for real-world applications.

The Last Byte: Learning Just Enough for Machine-Oriented Image Compression

Belief-based programming is a probabilistic extension of the GOLOG program family where every action and sensing result can be noisy and every test condition refers to the agent’s subjective beliefs. Inherited from GOLOG programs, the action-centered feature makes belief programs fairly suitable for high-level robot control under uncertainty. An important step before deploying such a program is to verify whether it satisfies certain properties. At least two problems exist in verifying such programs: how to formally specify program properties and what is the complexity of the verification problem.


In this paper, we propose a formalism for belief programs based on a modal logic of actions and beliefs which allows us to conveniently express PCTL-like temporal properties. We also investigate the decidability and undecidability of the verification problem.

A Framework for Belief-based Programs and Their Verification

We study the computational complexity of winner determination problems in approval-based committee elections under Thiele voting rules. These form a class of rules parameterized by a fixed weight vector that specifies how a voter's satisfaction depends on the number of approved candidates elected. We first analyze the structure of optimal solutions based on the sets of voters who approve each candidate---that is, how voters' approval ballots induce dependencies between candidates---revealing constraints on a winning committee under any fixed Thiele voting rule. Using this, we design a set of FPT algorithms for Proportional Approval Voting (PAV) and other Thiele rules on a natural restricted domain known as the Voter Interval (VI) domain---that is, after a suitable ordering of voters, each candidate is approved by a consecutive interval of voters. In particular, we show that every Thiele rule on VI is FPT with respect to a parameter for which the problem is NP-hard on general instance, even when the parameter takes constant values. Our results advance the understanding of the computational complexity of PAV on Voter Interval instances, which remains one of the central open questions in this area [Peters, AAAI 2018]. We further resolve two open questions from the literature on PAV (and other Thiele voting rules) [Yang and Wang, AAMAS 2018] by providing a polynomial-time algorithm for instances where each candidate is approved by at most two voters, and an FPT algorithm parameterized by the total score of a winning committee.

Algorithms for Structured Elections Under Thiele Voting Rules

Large Multimodal Models (LMMs) face notable challenges when encountering multimodal knowledge conflicts, particularly under retrieval-augmented generation (RAG) frameworks, where the contextual information from external sources may contradict the model’s internal parametric knowledge, leading to unreliable outputs. However, existing benchmarks fail to reflect such realistic conflict scenarios. Most focus solely on intra-memory conflicts, while context-memory and inter-context conflicts remain largely unaddressed. Furthermore, commonly used factual knowledge-based evaluations are often overlooked, and existing datasets lack a thorough investigation into conflict detection capabilities.To bridge this gap, we propose MMKC-Bench, a benchmark designed to evaluate factual knowledge conflicts in both context-memory and inter-context scenarios. MMKC-Bench encompasses four types of multimodal knowledge conflicts and includes 1,881 knowledge instances and 3,997 images across 32 broad types, collected through automated pipelines with human verification.
We evaluate four representative series of LMMs on both model behavior analysis and conflict detection tasks. Our findings show that while current LMMs are capable of recognizing knowledge conflicts, they tend to favor internal parametric knowledge over external evidence. We hope MMKC-Bench will foster further research in multimodal knowledge conflict and enhance the development of multimodal RAG systems.

Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models

Large language models (LLMs) have shown impressive multilingual capabilities through pretraining on diverse corpora. While these models show strong reasoning abilities, their performance varies significantly across languages due to imbalanced training data distribution. Existing approaches using sample-level translation for extensive multilingual pretraining and cross-lingual tuning face scalability challenges and often fail to capture nuanced reasoning processes across languages. In this paper, we introduce **AdaMCoT** (Adaptive Multilingual Chain-of-Thought), a framework that enhances multilingual factual reasoning by dynamically routing thought processes in intermediary “thinking languages” before generating target-language responses. AdaMCoT leverages a language-agnostic core and incorporates an adaptive, reward-based mechanism for selecting optimal reasoning pathways without requiring additional pretraining. Our comprehensive evaluation across multiple benchmarks demonstrates substantial improvements in both factual reasoning quality and cross-lingual consistency, with particularly strong performance gains in low-resource language settings. An in-depth analysis of the model’s hidden states and semantic space further elucidates the underlying mechanism of our method. The results suggest that adaptive reasoning paths can effectively bridge the performance gap between high- and low-resource languages while maintaining cultural and linguistic nuances.

AdaMCoT: Rethinking Cross-Lingual Factual Reasoning Through Adaptive Multilingual Chain-of-Thought

Infrared and visible image fusion (IVIF) integrates complementary visual information to produce enhanced representations. However, most existing IVIF methods generate fixed outputs, lacking the flexibility to adapt to user-specified requirements. Recent text-guided approaches offer partial controllability but remain limited to global or semantic-level fusion, unable to achieve instance-level control. This limitation primarily arises from two challenges: the absence of datasets linking textual instructions with corresponding spatial annotations, and the use of coarse cross-modal alignment methods incapable of accurately matching textual inputs with visual features. To overcome these challenges, we propose ControlFuse, a controllable IVIF framework enabling multi-granularity fusion guided directly by user instructions. First, we construct an automated multi-granularity dataset that provides explicit textual-mask correspondences at global, semantic, and instance levels. Second, inspired by manifold geometry, we design a multimodal feature interaction module consisting of a Feature Manifold Converter (FMC) and Curvature-Guided Interaction (CGI). FMC projects textual and visual features into a unified manifold space, while CGI leverages manifold curvature as a geometric cue to refine cross-modal alignment. Extensive experiments demonstrate that ControlFuse achieves precise and flexible controllability across different fusion granularities, benefiting high-level tasks.

ControlFuse: Instruction-guided Multi-Granularity Controllable Image Fusion

The rapid advancement of AIGC techniques has unlocked opportunities in generating diverse and compelling advertisement images based on referenced product images and textual scene descriptions.
This capability substantially reduces human labor and production costs in traditional marketing workflows.
However, existing AIGC techniques either demand extensive fine-tuning for each referenced image to achieve high fidelity, or they struggle to maintain fidelity across diverse products, making them impractical for e-commerce and marketing industries.
To tackle this limitation, we first construct AdProd-100K, a large-scale advertising image generation dataset.
A key innovation in its construction is our dual data augmentation strategy, which fosters robust, 3D-aware representations crucial for realistic and high-fidelity image synthesis.
Leveraging this dataset, we propose RefAdGen, which ensures high fidelity by integrating product features into the text-described scene using the Mask-Guided Attention Fusion (MGAF) mechanism.
Extensive experiments conducted on AdProd-100K demonstrate that RefAdGen achieves state-of-the-art performance in both product fidelity and overall generation quality, offering a scalable and cost-effective alternative to traditional workflows while achieving remarkable visual results.

RefAdGen: High-Fidelity Advertising Image Generation

The rapid proliferation of smart-city ecosystems has significantly amplified the demand for Li-ion batteries, which now serve as the primary energy source for sustainable transportation systems such as e-bikes. Ensuring battery safety and optimal performance is crucial, yet challenging due to complex intrinsic dynamics and extrinsic operating conditions. This paper presents LiBrain, an innovative LLM-powered, time-series-aware retrieval-augmented framework designed to simultaneously address both safety and performance challenges through three synergistic components: (1) a distributed IoT-enabled edge network for continuous real-time battery monitoring and data acquisition, (2) a pretrained deep multi-task diagnostic engine capable of comprehensive battery performance forecasting, and (3) a knowledge-base augmentation module that transforms technical diagnostics into clear, actionable guidance tailored for e-bike users. Functioning as an intelligent battery management assistant, LiBrain effectively bridges the gap between expert-level real-time analytics and practical, user-friendly instructions. Extensive validation across a real-world operational e-bike battery-swap network demonstrates LiBrain's exceptional capabilities, achieving a 95\% adoption rate in hazardous alarm detection and 92\% in battery-status prediction. In real application, Li-Brain has processed over 500 million battery events, managed almost 10 million inquiries and 1 million alarms annually, and identified 10\% of on-site batteries daily for proactive replacement, thereby maintaining operational safety and reliability.

LiBrain: LLM-Powered Li-ion Battery Diagnostics with Time-Series-Aware Retrieval-Augmented Framework for E-bikes

Multimodal sentiment analysis remains a challenging task due to the inherent heterogeneity across modalities. Such heterogeneity often manifests as asynchronous signals, imbalanced information between modalities, and interference from task-irrelevant noise, hindering the learning of robust and accurate sentiment representations. To address these issues, we propose a factorized multimodal fusion framework that first disentangles each modality into shared and unique representations, and then suppresses task-irrelevant noise within both to retain only sentiment-critical representations. This fine-grained decomposition improves representation quality by reducing redundancy, prompting cross-modal complementarity, and isolating task-relevant sentiment cues. Rather than manipulating the feature space directly, we adopt a mutual information–based optimization strategy to guide the factorization process in a more stable and principled manner. To further support feature extraction and long-term temporal modeling, we introduce two auxiliary modules: a Mixture of Q-Formers, placed before factorization, which precedes the factorization and uses learnable queries to extract fine-grained affective features from multiple modalities, and a Dynamic Contrastive Queue, placed after factorization, which stores latest high-level representations for contrastive learning, enabling the model to capture long-range discriminative patterns and improve class-level separability. Extensive experiments on multiple public datasets demonstrate that our method consistently outperforms existing approaches, validating the effecti veness and robustness of the proposed framework.

Downloads

Next from AAAI 2026

InterCoser: Interactive 3D Character Creation with Disentangled Fine-Grained Features

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

InterCoser: Interactive 3D Character Creation with Disentangled Fine-Grained Features

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads