Singapore

CLIP is a seminal multimodal model that maps images and text into a shared representation space by contrastive learning on billions of image–caption pairs. Inspired by the rapid progress of large language models (LLMs), we investigate how the superior linguistic understanding and broad world knowledge of LLMs can further strengthen CLIP—particularly in handling long, complex captions. We introduce an efficient fine-tuning framework that embeds an LLM into a pretrained CLIP while incurring almost the same training cost as regular CLIP fine-tuning. Our method first “embedding-izes” the LLM for the CLIP setting, then couples it to the pretrained CLIP vision encoder through a lightweight adaptor trained on only a few million image–caption pairs. With this strategy we achieve large performance gains—without large-scale retraining—over state-of-the-art CLIP variants such as EVA02 and SigLIP-2. The LLM-enhanced CLIP delivers consistent improvements across a wide spectrum of downstream tasks, including linear-probe classification, zero-shot image–text retrieval with both short and long captions (in English and other languages), zero-shot/supervised image segmentation, object detection, and used as tokenizer for multimodal large-model benchmarks.

AAAI 2026

LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation

multimodal learning

representation learning

information retrieval

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Subject-driven generation, which aims to synthesize visual content for a given identity $V^*$ with specific attributes, has garnered increasing attention in recent years. While existing methods demonstrate impressive identity consistency for both single and multiple identities, they often lack user-specified spatial control. Recent approaches, such as OminiControl-2 and EasyControl, enable inpainting conditioned on a single identity but fall short in multi-identity scenarios. In this paper, we introduce \textbf{BoundID}, a dataset synthesis pipeline for generating multi-identity images with bounding box annotations, and introduce \textbf{Inpaint-Anywhere}, a diffusion transformer framework for multi-identity inpainting. Given multiple identity references and corresponding masks, our method simultaneously generates all desired identities at precise locations while achieving both high identity and prompt fidelity. Extensive experiments show that Inpaint-Anywhere achieves state-of-the-art performance in multi-identity inpainting.

Inpaint-Anywhere: Zero-Shot Multi-Identity Inpainting with Efficient Diffusion Transformer

The automatic movie dubbing model generates vivid speech from given scripts, replicating a speaker's timbre from a brief timbre prompt while ensuring lip-sync with the silent video.
Existing approaches simulate a simplified workflow where actors dub directly without preparation, overlooking the critical director–actor interaction. In contrast, authentic workflows involve a dynamic collaboration: directors actively engage with actors, guiding them to internalize the context cues, specifically emotion, before performance.
To address this issue, we propose a new Retrieve-Augmented Director-Actor Interaction Learning scheme to achieve authentic movie dubbing, termed \textbf{Authentic-Dubber}, which contains three novel mechanisms: 
(1) We construct a multimodal Reference Footage library to simulate the learning footage provided by directors. Note that we integrate Large Language Models (LLMs) to achieve deep comprehension of emotional representations across multimodal signals.
(2) To emulate how actors efficiently and comprehensively internalize director-provided footage during dubbing, we propose an Emotion-Similarity-based Retrieval-Augmentation strategy. This strategy retrieves the most relevant multimodal information that aligns with the target silent video.
(3) We develop a Progressive Graph-based speech generation approach that incrementally incorporates the retrieved multimodal emotional knowledge, thereby simulating the actor's final dubbing process.
The above mechanisms enable the Authentic-Dubber to faithfully replicate the authentic dubbing workflow, achieving comprehensive improvements in emotional expressiveness. Both subjective and objective evaluations on the V2C-Animation benchmark dataset validate the effectiveness.
The source code and model checkpoints will be released to the public. The demos are available at https://github.com/MovieDubbing/Authentic-Dubber.

Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning

Existing graph neural networks typically rely on heuristic choices for hidden dimensions and propagation depths, which often lead to severe information loss during propagation, known as over-squashing. To address this issue, we propose Channel Capacity Constrained Estimation (C$^3$E), a novel framework that formulates the selection of hidden dimensions and depth as a nonlinear programming problem grounded in information theory. Through modeling spectral graph neural networks as communication channels, our approach directly connects channel capacity to hidden dimensions, propagation depth, propagation mechanism, and graph structure. Extensive experiments on nine public datasets demonstrate that hidden dimensions and depths estimated by C$^3$E can mitigate over-squashing and consistently improve representation learning. Experimental results show that over-squashing occurs due to the cumulative compression of information in representation matrices. Furthermore, our findings show that increasing hidden dimensions indeed mitigates information compression, while the role of propagation depth is more nuanced, uncovering a fundamental balance between information compression and representation complexity.

How Wide and How Deep? Mitigating Over-squashing of GNNs via Channel Capacity Constrained Estimation

The ability to engineer optimized protein variants has transformative potential for biotechnology and medicine. Prior sequence-based optimization methods struggle with the high-dimensional complexities due to the epistasis effect and the disregard for structural constraints. To address this, we propose HADES, a Bayesian optimization method utilizing Hamiltonian dynamics to efficiently sample from a structure-aware approximated posterior. Leveraging momentum and uncertainty in the simulated physical movements, HADES enables rapid transition of proposals toward promising areas. A position discretization procedure is introduced to propose discrete protein sequences from such continuous state system. The posterior surrogate is powered by a two-stage encoder-decoder framework to determine the structure and function relationships between mutant neighbors, consequently learning a smoothed landscape to sample from. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines in in-silico evaluations across most metrics. Remarkably, our approach offers a unique advantage by leveraging the mutual constraints between protein structure and sequence, facilitating the design of protein sequences with similar structures and optimized properties.

Efficient Protein Optimization via Structure-aware Hamiltonian Dynamics

As industrial manufacturing scales, automating fine-grained product image analysis has become critical for quality control. However, existing approaches are hindered by limited dataset coverage and poor model generalization across diverse and complex anomaly patterns. To address these challenges, we introduce MAU-Set, a comprehensive dataset for Multi-type industrial Anomaly Understanding. 
It spans multiple industrial domains and features a hierarchical task structure, ranging from binary classification to complex reasoning.
Alongside this dataset, we establish a rigorous evaluation protocol to facilitate fair and comprehensive model assessment. Building upon this foundation, we further present MAU-GPT, a domain-adapted multimodal large model specifically designed for industrial anomaly understanding. It incorporates a novel AMoE-LoRA mechanism that unifies anomaly-aware and generalist experts adaptation, enhancing both understanding and reasoning across diverse defect classes. Extensive experiments show that MAU-GPT consistently outperforms prior state-of-the-art methods across all domains, demonstrating strong potential for scalable and automated industrial inspection. Project resources are publicly available at: https://anonymous.4open.science/r/MAU-GPT-268D.

MAU-GPT: Enhancing Multi-type Industrial Anomaly Understanding via Anomaly-aware and Generalist Experts Adaptation

Recent advancements in 3D Gaussian Splatting (3DGS) have demonstrated remarkable rendering quality, However, their substantial computational demands hinder practical deployment on resource-constrained devices. We propose a novel plug-and-play structured compression framework that significantly reduces computational overhead while maintaining rendering fidelity. We first discover that the statistical distribution of anchor vectors is decoupled from rendering quality. Based on this finding, we propose a distribution regularization method that enforces alignment to standard Gaussian distribution through KL divergence while optimizing Gaussian radius, significantly improving entropy coding efficiency. Second, we innovatively introduce an opacity-based probabilistic pruning mechanism that transforms pruning into an opacity optimization problem, achieving intelligent scene sparsification while allowing flexible adjustment according to hardware resources. Finally, we design a lightweight high-frequency compensation network that regards the high-frequency loss caused by over-compression as a residual and effectively recovers the high-frequency details lost during the compression process through residual learning. All modules are plug-and-play and can be seamlessly integrated into mainstream structured 3DGS frameworks. Extensive experiments on Synthetic-NeRF, Tanks&Temples, Mip-NeRF360 and DeepBlending datasets demonstrate that our method significantly reduces size by over 80x compared to vanilla 3DGS while simultaneously improving fidelity. Furthermore, it achieves a better size reduction and a 30% improvement in entropy encoding efficiency when compared to Scaffold-GS, while meeting the requirements for real-time rendering.

Plug-and-Play Optimization for 3D Gaussian Splatting Compression: Distribution Regularization, Probabilistic Pruning and Detail Compensation

Code models have become integral to modern software development, yet they remain vulnerable to backdoor attacks through poisoned training data. Current code backdoor attacks struggle with a critical trade-off. Static triggers using fixed code patterns achieve high transferability across different settings, but are easily detected by defenses. Conversely, dynamic triggers that adapt to code context evade detection effectively but exhibit poor cross-dataset transferability. Moreover, existing dynamic approaches unrealistically assume attackers have access to victims' training data, limiting their practical applicability. To overcome these limitations, we introduce Sharpness-aware Transferable Adversarial Backdoor (STAB), a novel attack that achieves transferability and stealthiness without accessing victim data. Our key idea is that adversarial perturbations discovered in flat regions of the loss landscape transfer more effectively across datasets than those found in sharp minima. STAB leverages this by training a surrogate model with Sharpness-Aware Minimization (SAM) to guide model parameters toward these flat regions. We then employ a Gumbel-Softmax based optimization to transform the discrete search for trigger tokens into a differentiable process, generating context-aware adversarial triggers. Experiments on three datasets and two code models demonstrate the superiority of STAB. Compared to static triggers, STAB significantly improves stealiness, maintaining 73.2% average attack success rate after defense (ASR-D) versus near-zero for static approaches. In cross-dataset scenarios, STAB also outperforms the state-of-the-art dynamic attack, AFRAIDOOR, with a 12.4% higher ASR-D, while preserving model performance on clean inputs.

Transferable Backdoor Attacks for Code Models via Sharpness-Aware Adversarial Perturbation

Neural textures have emerged as pivotal assets in next-generation neural rendering pipelines. However, hardware limitations and programming interface constraints lead to suboptimal performance in multi-instance real-time rendering scenarios. This bottleneck becomes particularly acute for texture-intensive tasks such as font rendering. To address this, we propose neural outline cache (NOC), a novel neural font representation supporting real-time anti-aliased rendering and procedural editing within modern neural graphics pipelines.
NOC's lightweight network leverages multi-resolution hash encoding to cache spline-derived SDFs, delivering anti-aliased rendering via standard graphics pipelines. For massive-instance scalability, our cache buffer layout and batch-fused inference—tailored for NOC—eliminate neural texture binding bottlenecks. Benchmarks achieve 57.35 dB PSNR, 0.9980 SSIM, and 0.00116 MSE in offline rendering, while sustaining 0.51ms latency with 500 live instances. The integrated procedural node system enables real-time artistic font synthesis, certifying NOC as a production-grade neural asset.

Neural Outline Cache for Real-time Anti-aliasing Font Rendering

Decentralized optimization is critical for solving large-scale machine learning problems over distributed networks, where multiple nodes collaborate through local communication. In practice, the variances of stochastic gradient estimators often differ across nodes, yet their impact on algorithm design and complexity remains unclear. To address this issue, we propose D-NSS, a decentralized algorithm with node-specific sampling, and establish its sample complexity depending on the arithmetic mean of local standard deviations, achieving tighter bounds than existing methods that rely on the worst-case or quadratic mean. We further derive a matching sample complexity lower bound under heterogeneous variance, thereby proving the optimality of this dependence. Moreover, we extend the framework with a variance reduction technique and develop D-NSS-VR, which under the mean-squared smoothness assumption attains an improved sample complexity bound while preserving the arithmetic-mean dependence. Finally, numerical experiments validate the theoretical results and demonstrate the effectiveness of the proposed algorithms.

Decentralized Non-convex Stochastic Optimization with Heterogeneous Variance

Storytelling video generation (SVG) aims to produce coherent and visually rich multi-scene videos that follow a structured narrative. Existing methods primarily employ LLM for high-level planning to decompose a story into scene-level descriptions, which are then independently generated and stitched together. However, these approaches struggle with generating high-quality videos aligned with the complex single-scene description, as visualizing such complex description involves coherent composition of multiple characters and events, complex motion synthesis and character customization with sequential motions. To address these challenges, we propose DreamRunner, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout and motion planning. Next, DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DreamRunner's robust ability to generate multi-object interactions with qualitative examples.

Content not yet available

Next from AAAI 2026

Inpaint-Anywhere: Zero-Shot Multi-Identity Inpainting with Efficient Diffusion Transformer

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES