Singapore

Multimodal learning often relies on aligning representations across modalities to enable effective information integration—an approach traditionally assumed to be universally beneficial. However, prior research has primarily taken an observational approach, examining naturally occurring alignment in multimodal data and exploring its correlation with model performance, without systematically studying the direct effects of explicitly enforced alignment between representations of different modalities. In this work, we investigate how explicit alignment influences both model performance and representation alignment under different modality-specific information structures. Specifically, we introduce a controllable contrastive learning module that enables precise manipulation of alignment strength during training, allowing us to explore when explicit alignment improves or hinders performance. Our results on synthetic and real datasets under different data characteristics show that the impact of explicit alignment on the performance of unimodal models is related to the characteristics of the data: the optimal level of alignment depends on the amount of redundancy between the different modalities. We can find an optimal alignment strength that balances modality-specific signals and shared redundancy in the mixed information distributions. This work can help practitioners on when and how to enforce alignment for optimal unimodal encoder performance.

AAAI 2026

To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance

multimodal learning

representation learning

optimization

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Text-to-video generation models have shown significant progress in recent years. However, they still struggle with compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with differ- ent objects, and interactions between objects. Inspired by ef- fective human creative workflow, we propose GENMAC, a multi-agent collaboration framework that enables composi- tional text-to-video generation. The framework incorporates a three-stage collaborative workflow: DESIGN, GENERATION, and REDESIGN, with an iterative loop between the latter two stages to progressively verify and refine the generated videos. In the DESIGN stage, a large language model (Design Agent) plans objects with layouts, and then a video gener- ation model synthesizes videos in the GENERATION stage. The REDESIGN stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and re- design the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid halluci- nation of single-agent and naive multi-agent frameworks, we apply a division-of-labor strategy in this stage by introducing a sequence of specialized agents, executed by MLLMs (mul- timodal large language models): Verification Agent, Sugges- tion Agent, Correction Agent, and Output Structuring Agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a suite of correction agents, each specialized for one scenario. Ex- tensive experiments demonstrate the effectiveness of GEN- MAC by generating videos based on long compositional text prompts and achieving state-of-the-art in the compositional text-to-video generation benchmark.

GENMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Gaze redirection methods aim to generate realistic human face images with controllable eye movement. However, recent methods often struggle with 3D consistency, efficiency, or quality, limiting their practical applications. In this work, we propose RTGaze, a real-time and high-quality gaze redirection method. Our approach learns a gaze-controllable facial representation from face images and gaze prompts, then decodes this representation via neural rendering for gaze redirection. Additionally, we distill face geometric priors from a pretrained 3D portrait generator to enhance generation quality. We evaluate RTGaze both qualitatively and quantitatively, demonstrating state-of-the-art performance in efficiency, redirection accuracy, and image quality across multiple datasets. Our system achieves real-time, 3D-aware gaze redirection with a feedforward network (~0.06 sec/image), making it 800× faster than the previous state-of-the-art 3D-aware methods. We will release the code to facilitate future research.

RTGaze: Real-Time 3D-Aware Gaze Redirection from a Single Image

Long-tail motion forecasting is a core challenge for autonomous driving, where rare yet safety-critical events-such as abrupt maneuvers and dense multi-agent interactions-dominate real-world risk. Existing approaches struggle in these scenarios because they rely on either non-interpretable clustering or model-dependent error heuristics, providing neither a differentiable notion of “tailness” nor a mechanism for rapid adaptation. We propose SAML, a Semantic-Aware Meta-Learning framework that introduces the first differentiable definition of tailness for motion forecasting. SAML quantifies motion rarity via semantically meaningful intrinsic (kinematic, geometric, temporal) and interactive (local and global risk) properties, which are fused by a Bayesian Tail Perceiver into a continuous, uncertainty-aware Tail Index. This Tail Index drives a meta-memory adaptation module that couples a dynamic prototype memory with an MAML-based cognitive set mechanism, enabling fast adaptation to rare or evolving patterns. Experiments on nuScenes, NGSIM, and HighD show that SAML achieves state-of-the-art overall accuracy and substantial gains on top 1-5% worst-case events, while maintaining high efficiency. Our findings highlight semantic meta-learning as a pathway toward robust and safety-critical motion forecasting.

Differentiable Semantic Meta-Learning Framework for Long-Tail Motion Forecasting in Autonomous Driving

Current semantic segmentation approaches for point cloud scenes heavily rely on manual labeling, while research on unsupervised semantic segmentation methods specifically for raw point clouds is still in its early stages. Unsupervised point cloud learning poses significant challenges due to the absence of annotation information and the lack of pre-training. The development of effective strategies is crucial in this context. In this paper, we propose a novel prototype library-driven unsupervised point cloud semantic segmentation strategy that utilizes Structure Learning and Consistent Reasoning (P-SLCR). First, we propose a Consistent Structure Learning to establish structural feature learning between consistent points and the library of consistent prototypes by selecting high-quality features. Second, we propose a Semantic Relation Consistent Reasoning that constructs a prototype inter-relation matrix between consistent and ambiguous prototype libraries separately. This process ensures the preservation of semantic consistency by imposing constraints on consistent and ambiguous prototype libraries through the prototype inter-relation matrix. Finally, our method was extensively evaluated on the S3DIS, SemanticKITTI, and Scannet datasets, achieving the best performance compared to unsupervised methods. Specifically, the mIoU of 47.1% is achieved for Area-5 of the S3DIS dataset, surpassing the classical fully supervised method PointNet by 2.5%. The source code will be available soon.

P-SLCR: Unsupervised Point Cloud Semantic Segmentation via Prototypes Structure Learning and Consistent Reasoning

Communication efficiency in federated learning (FL) remains a critical challenge in resource-constrained environments. While prototype-based FL reduces communication overhead by sharing class prototypes---mean activations in the penultimate layer---instead of model parameters, its efficiency degrades with larger feature dimensions and class counts. We propose TinyProto, which addresses these limitations through Class-wise Prototype Sparsification (CPS) and Adaptive Prototype Scaling (APS). CPS enables structured sparsity by allocating specific dimensions to class prototypes and transmitting only non-zero elements, thereby achieving higher communication efficiency. In contrast, APS scales prototypes based on class distributions, thereby improving performance. Our experiments demonstrate that TinyProto reduces communication costs by up to $10\times$ compared to existing methods while improving performance. Beyond communication efficiency, TinyProto offers crucial advantages: it achieves compression without client-side computational overhead and supports heterogeneous architectures, making it particularly suitable for resource-constrained heterogeneous FL scenarios.

Communication-Efficient Heterogeneous Federated Learning with Sparse Prototypes in Resource-Constrained Environments

Real-time stereo matching methods primarily focus on enhancing in-domain performance but often overlook the critical importance of generalization in real-world applications. In contrast, recent stereo foundation models leverage monocular foundation models (MFMs) to improve generalization, but typically suffer from substantial inference latency. To address this trade-off, we propose GGEV, a novel real-time stereo matching network that achieves strong generalization. We first extract depth-aware features that encode domain-invariant structural priors as guidance for cost aggregation. Subsequently, we introduce a Depth-aware Dynamic Cost Aggregation (DDCA) module that adaptively incorporates these priors into each disparity hypothesis, effectively enhancing fragile matching relationships in unseen scenes. Both steps are lightweight and complementary, leading to the construction of a generalized geometry encoding volume with strong generalization capability. Experimental results demonstrate that our GGEV surpasses all existing real-time methods in zero-shot generalization capability, and achieves state-of-the-art performance on the KITTI 2012, KITTI 2015, and ETH3D benchmarks.

Generalized Geometry Encoding Volume for Real-time Stereo Matching

Private data holds promise for improving LLMs due to its high quality, but its scattered distribution across data silos and the high computational demands of LLMs limit their deployment in federated environments. To address this, the transformer-based federated split models are proposed, which offload most model parameters to the server (or distributed clients) while retaining only a small portion on the client to ensure data privacy. Despite this design, they still face three challenges: 1) Peer-to-peer key encryption struggles to secure transmitted vectors effectively; 2) The auto-regressive nature of LLMs means that federated split learning can only train and infer sequentially, causing high communication overhead; 3) Fixed partition points lack adaptability to downstream tasks. In this paper, we introduce FedSEA-LLaMA, a Secure, Efficient, and Adaptive Federated splitting framework based on LLaMA2. First, we inject Gaussian noise into forward-pass hidden states to enable secure end-to-end vector transmission. Second, we employ attention-mask compression and KV cache collaboration to reduce communication costs, accelerating training and inference. Third, we allow users to dynamically adjust the partition points for input/output blocks based on specific task requirements. Experiments on natural language understanding, summarization, and conversational QA tasks show that FedSEA-LLaMA maintains performance comparable to centralized LLaMA2 and achieves up to 8× speedups in training and inference. Further analysis of privacy attacks and different partition points also demonstrates the effectiveness of FedSEA-LLaMA in security and adaptability.

FedSEA-LLaMA: A Secure, Efficient and Adaptive Federated Splitting Framework for Large Language Models

With the rapid integration of large language models (LLMs) into medical decision-support aids, ensuring reliability in reasoning steps—not just final answers—is increasingly critical. Two key safety dimensions are Chain-of-Thought (CoT) faithfulness, which assesses alignment of the model’s reasoning process with both its response and medical facts, and sycophancy, an emergent misalignment where models follow misleading cues instead of factual correctness. Yet existing benchmarks tend to prioritize performance evaluation, frequently collapsing nuanced safety vulnerabilities into a single accuracy score. To fill this gap, we introduce MedOmni-45°, a benchmark and evaluation workflow explicitly designed to quantify the safety–performance trade-off in LLMs under manipulative hint conditions. The benchmark contains 1,804 reasoning-focused medical questions across six clinical specialties and three task types, including 500 publicly comparable items from MedMCQA. Each question is systematically augmented with seven manipulative hint types, each embedding two distinct misleading cue variants, along with a No-Hint baseline, resulting in approximately 27,000 unique inputs. These inputs are then evaluated across seven LLMs spanning open- and closed-source, general-purpose and medical-specific, and base versus reasoning-enhanced variants, amounting to over 189K total inference instances. Three orthogonal metrics (Accuracy, CoT-Faithfulness, Anti-Sycophancy) are combined into a composite score visualized via a 45° safety–performance plot. Results reveal a universal trade-off, with no model surpassing the ideal diagonal. Open-source QwQ-32B approaches closest at 43.81°, demonstrating notable safety while not surpassing others in performance. MedOmni-45° thus highlights critical vulnerabilities of LLMs in reasoning-oriented medical tasks, offering a robust benchmark for future alignment research.

MedOmni-45°: A Safety–Performance Benchmark for Reasoning-Oriented LLMs in Medicine

Reconstructing dense geometry for dynamic scenes from a monocular video is a critical yet challenging task.
Recent memory-based methods enable efficient online reconstruction, but they fundamentally suffer from a Memory Demand Dilemma:
The memory representation faces an inherent conflict between the long-term stability required for static structures and the rapid, high-fidelity detail retention needed for dynamic motion.
This conflict forces existing methods into a compromise, leading to either geometric drift in static structures or blurred, inaccurate reconstructions of dynamic objects.
To address this dilemma, we propose Mem4D, a novel framework that decouples the modeling of static geometry and dynamic motion. Guided by this insight, we design a dual-memory architecture: 
1) The Transient Dynamics Memory (TDM) focuses on capturing high-frequency motion details from recent frames, enabling accurate and fine-grained modeling of dynamic content;
2) The Persistent Structure Memory (PSM) compresses and preserves long-term spatial information, ensuring global consistency and drift-free reconstruction for static elements.
By alternating queries to these specialized memories, Mem4D simultaneously maintains static geometry with global consistency and reconstructs dynamic elements with high fidelity.
Experiments on challenging benchmarks demonstrate that our method achieves state-of-the-art or competitive performance while maintaining high efficiency. Codes will be publicly available at https://github.com/Mem4D/Mem4D.

Mem4D: Decoupling Static and Dynamic Memory for Dynamic Scene Reconstruction

Continual learning for action recognition is a critical capability for next-generation Extended Reality (XR) systems. Yet it faces a severe real-world challenge: strict user privacy that prohibits data rehearsal. While recent prompt-based continual learning methods show promise, we argue their flat, single-granularity design is structurally mismatched to the complexity of human actions. This monolithic architecture fails to model the inherent hierarchical structure of individual actions and overlooks standard action primitives shared across tasks, resulting in suboptimal performance and hindered knowledge transfer. To overcome this limitation, we propose DPCA, a novel spatio-temporal continual learning framework with multi-granularity adaptive prompting. DPCA learns three synergistic components to resolve this mismatch. First, the task-specific prompter employs a multi-granularity query system to capture the unique, compositional semantics of each action. Second, the task-agnostic prompter learns a globally shared vocabulary of ``action primitives," providing a stable and generalizable knowledge base to mitigate catastrophic forgetting. Furthermore, we introduce a Dissimilarity Attention Rectification at each granularity level, which leverages a reverse attention mechanism to model class-agnostic background information, effectively alleviating overfitting. The synergy between these components enables robust model adaptation without requiring access to past data. Rigorous experiments on the NTU RGB+D benchmark, under a strict rehearsal-free, few-shot protocol, confirm that DPCA establishes a new state-of-the-art, advancing the realization of brilliant and privacy-respecting XR systems.

Content not yet available

Next from AAAI 2026

GENMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES