Singapore

Structured sparsity has emerged as a popular model pruning technique, widely adopted in various architectures, including CNNs, Transformer models, and especially large language models (LLMs) in recent years. A promising direction to further improve post-pruning performance is weight permutation, which reorders model weights into patterns more amenable to pruning. However, the exponential growth of the permutation search space with the scale of Transformer architectures forces most methods to rely on greedy or heuristic algorithms, limiting the effectiveness of reordering.


In this work, we propose a novel end-to-end learnable permutation framework. Our method introduces a learnable permutation cost matrix to quantify the cost of swapping any two input channels
of a given weight matrix, a differentiable bipartite matching solver to obtain the optimal binary permutation matrix given a cost matrix, and a sparsity optimization loss function to directly optimize the permutation operator.
We extensively validate our approach on vision and language Transformers, demonstrating that our method achieves state-of-the-art permutation results for structured sparsity.

AAAI 2026

Learnable Permutation for Structured Sparsity on Transformer Models

learnable permutation

structured sparsity

transformers

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Recent advances in model compression have highlighted the potential of low-bit precision techniques, with Binary Neural Networks (BNNs) attracting attention for their extreme efficiency. However, extreme quantization in BNNs limits representational capacity and destabilizes training, posing significant challenges for lightweight architectures with depth-wise convolutions.
To address this, we propose a 1.58-bit convolution to enhance expressiveness and a pre-BN residual connection to stabilize optimization by improving the Hessian condition number. These innovations enable the first successful binarization of depth-wise convolutions in BNNs.
Our method achieves 32M OPs on ImageNet with MobileNet V1, establishing a new state-of-the-art in BNNs by outperforming prior methods with comparable OPs. Moreover, it consistently outperforms existing methods on various datasets, including CIFAR-10, CIFAR-100, STL-10, Tiny ImageNet, and Oxford Flowers 102, with accuracy improvements of up to 9.3 percentage points.

BD-Net: Has Depth-Wise Convolution Ever Been Applied in Binary Neural Networks?

Economic decision‑making depends not only on structured signals—such as prices and taxes—but also on unstructured language, including peer dialogue and media narratives. While multi‑agent reinforcement learning (MARL) has shown promise in optimizing economic decisions, it struggles with the semantic ambiguity and contextual richness of language. We propose LAMP (Language‑Augmented Multi‑Agent Policy), the first framework to integrate language into economic decision‑making, narrowing the gap to real‑world settings.
LAMP follows a Think–Speak–Decide pipeline:
(1) Think interprets numerical observations to extract short‑term shocks and long‑term trends, caching high‑value reasoning trajectories.
(2) Speak crafts and exchanges strategic messages based on the reasoning, updating beliefs by parsing peer communications.
(3) Decide fuses numerical data, reasoning, and reflections into a MARL policy to optimize language‑augmented decision‑making.
Experiments in economic simulation show that LAMP outperforms both MARL and LLM‑only baselines in cumulative return (+63.5%, +34.0%), robustness (+18.8%, +59.4%), and interpretability. These results demonstrates the potential of language‑augmented policies to deliver more effective and robust economic strategies.

Think, Speak, Decide: Language-Augmented Multi-Agent Reinforcement Learning for Economic Decision-Making

Reconstructing high dynamic range (HDR) images from low dynamic range (LDR) bursts plays an essential role in the computational photography. Impressive progress has been achieved by learning-based algorithms which require LDRHDR image pairs. However, these pairs are hard to be obtained, which motivates researchers to delve into the problem of annotation-efficient HDR image reconstructing: how to achieve comparable performance with limited HDR ground truths (GTs). This work attempts to address this problem from the view of semi-supervised learning where a teacher model generates pseudo HDR GTs for the LDR samples without GTs and a student model learns from pseudo GTs. Nevertheless, the confirmation bias, i.e., the student may learn from the artifacts in pseudo HDR GTs, presents an impediment. Trying to remove this impediment, an uncertainty-based masking process is proposed to discard unreliable parts of pseudo GTs on both pixel and patch levels, then trusted area can be learned by the student. With this novel masking process, our semi-supervised HDR reconstructing method not only outperforms previous annotation-efficient algorithms, but also achieves comparable performance with up-to-date fully-supervised methods by using only 6.7% HDR GTs.

Semi-Supervised High Dynamic Range Image Reconstructing via Bi-Level Uncertain Area Masking

Hypergraph neural networks (HNNs) have emerged as powerful tools for modeling high-order relationships in complex systems. However, most existing HNNs are designed under the assumption of homophily, which does not hold in many real-world scenarios where connected nodes often exhibit diverse semantics, i.e., heterophily. This inconsistency leads to suboptimal aggregation and degraded performance, especially in low-label regimes. While a few recent methods have attempted to enhance heterophilic hypergraph learning, they often rely heavily on label supervision and overlook the potential of self-supervised techniques. In this paper, we propose HeroCL, a heterophily-aware contrastive learning framework that improves hypergraph representation under both structural heterogeneity and label scarcity. Specifically, HeroCL integrates a multi-hop neighbor encoding module to capture informative higher-order context and incorporates two complementary contrastive objectives, label-aware and structure-aware, to guide representation learning from both semantic and relational perspectives. A multi-granularity contrastive strategy is introduced to exploit latent signals across multiple neighborhood levels. Extensive experiments on several benchmark datasets against 11 existing baselines demonstrate that HeroCL achieves consistent and significant performance gains, particularly under strong heterophily and limited supervision, validating its robustness and effectiveness.

Heterophily-aware Contrastive Learning for Heterophilic Hypergraphs

The rapid development of large language models (LLMs) has highlighted the need for efficient and reliable methods to evaluate their performance. Traditional evaluation methods often face challenges like high costs, limited task formats, dependence on human references, and systematic biases. To address these limitations, we propose Auto-PRE, an automatic LLM evaluation framework inspired by the peer review process. Unlike previous approaches that rely on human annotations, Auto-PRE automatically selects evaluator LLMs based on three core traits: consistency, pertinence, and self-confidence, which correspond to the instruction, content, and response stages, respectively, and collectively cover the entire evaluation process. Experiments on three representative tasks, including summarization, non-factoid QA, and dialogue generation, demonstrate that Auto-PRE achieves state-of-the-art performance while significantly reducing evaluation costs. Furthermore, the structured and scalable design of our automatic qualification exam framework provides valuable insights into automating the evaluation of LLMs-as-judges, paving the way for more advanced LLM-based evaluation frameworks.

Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation

Most existing RGB-Event trackers rely on strictly aligned datasets, overlooking the asynchronous spatio-temporal resolutions common in real-world scenarios. 
This methodological limitation impedes effective RGB-Event feature alignment and ultimately degrades tracking performance.
To overcome this limitation, we propose AlignTrack, a novel tracking framework built upon a Top-Down Alignment (TDA) strategy inspired by the human visual system. 
Our TDA framework follows an encode-decode-align paradigm: it first encodes multimodal features to generate target-related priors, which are then progressively decoded to guide a subsequent feature alignment pass. 
Within this framework, we introduce two key innovations: (1) a Cross-Prior Attention (CPA) module that effectively generates and integrates cross-modal priors, and (2) a Cross-Modal Semantic Alignment (CSA) loss that maximizes mutual information to enforce semantic consistency between modalities. 
Extensive experiments show that AlignTrack achieves state-of-the-art performance on four challenging RGB-Event tracking benchmarks, demonstrating its robustness in both aligned and unaligned scenarios. 
Ablation studies further validate the significant contribution of each proposed component.

AlignTrack: Top-Down Spatiotemporal Resolution Alignment for RGB-Event Visual Tracking

Large language model (LLM) agents have emerged as a promising solution for enhancing recommendation systems via user simulation. 
However, existing studies predominantly resort to prompt-based simulation using frozen LLMs, which frequently results in suboptimal item modeling and user preference learning, thereby ultimately constraining recommendation performance.
To address these challenges, we introduce VRAgent-R1, a novel agent-based paradigm that incorporates human-like intelligence in user simulation. Specifically, VRAgent-R1 comprises two distinct agents: the Item Perception (IP) Agent and the User Simulation (US) Agent, designed for interactive user-item modeling.
Firstly, the IP Agent emulates human-like progressive thinking based on MLLMs, effectively capturing hidden recommendation semantics in videos. With a more comprehensive multimodal content understanding provided by the IP Agent, the video recommendation system is equipped to provide higher-quality candidate items.
Subsequently, the US Agent refines the recommended video sets based on in-depth chain-of-thought (CoT) reasoning and achieves better alignment with real user preferences through reinforcement learning.
Experimental results on a large-scale video recommendation benchmark MicroLens-100k have demonstrated the effectiveness of our proposed VRAgent-R1 method,
e.g., the IP Agent achieves a 6.0\% improvement in NDCG@10, while the US Agent shows approximately 45.0\% higher accuracy in user decision simulation compared to state-of-the-art baselines.

VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning

Recent advances in diffusion models have achieved remarkable success in isolated computer vision tasks such as text-to-image generation, depth estimation, and optical flow. However, these models are often restricted by a ``single-task-single-model'' paradigm, severely limiting their generalizability and scalability in multi-task scenarios. Motivated by the cross-domain generalization ability of large language models, we propose a universal visual perception framework based on \emph{flow matching} that can generate diverse visual representations across multiple tasks.
Our approach formulates the process as a universal flow-matching problem from image patch tokens to task-specific representations rather than an independent generation or regression problem. By leveraging a strong self-supervised foundation model as the anchor and introducing a multi-scale, circular task embedding mechanism, our method learns a universal velocity field to bridge the gap between heterogeneous tasks, supporting efficient and flexible representation transfer.
Extensive experiments on classification, detection, segmentation, depth estimation, and image-text retrieval demonstrate that our model achieves competitive performance in both zero-shot and fine-tuned settings, outperforming prior generalist and several specialist models. Ablation studies further validate the robustness, scalability, and generalization of our framework. Our work marks a significant step towards general-purpose visual perception, providing a solid foundation for future research in universal vision modeling.

Visual Bridge: Universal Visual Perception Representations Generating

Multi-agents rely on accurate poses to share and align observations, enabling a collaborative perception of the environment. However, traditional GNSS-based localization often fails in GNSS-denied environments, making consistent feature alignment difficult in collaboration. To tackle this challenge, we propose a robust GNSS-free collaborative perception framework based on LiDAR localization. Specifically, we propose a lightweight Pose Generator with Confidence (PGC) to estimate compact pose and confidence representations. To alleviate the effects of localization errors, we further develop the Pose-Aware Spatio-Temporal Alignment Transformer (PASTAT), which performs confidence-aware spatial alignment while capturing essential temporal context. Additionally, we present a new simulation dataset, V2VLoc, which can be adapted for both LiDAR localization and collaborative detection tasks. V2VLoc comprises three subsets: Town1Loc, Town4Loc, and V2VDet. Town1Loc and Town4Loc offer multi-traversal sequences for training in localization tasks, whereas V2VDet is specifically intended for the collaborative detection task. Extensive experiments conducted on the V2VLoc dataset demonstrate that our approach achieves state-of-the-art performance under GNSS-denied conditions. We further conduct extended experiments on the real-world V2V4Real dataset to validate the effectiveness and generalizability of PASTAT.

V2VLoc: Robust GNSS-Free Collaborative Perception via LiDAR Localization

Spatio-temporal data generation aims to synthesize realistic urban data across graph nodes by learning spatial and temporal dependencies. This task plays a crucial role in urban planning by enabling the simulation of unobserved nodes. However, existing approaches face critical limitations that time series generation methods fail to generalize to unseen nodes, while spatio-temporal generative models are either restricted to the trajectory generation task or dependent on auxiliary data inputs. To bridge these gaps, we propose a Knowledge Graph Guided Heterogeneity-Informed Diffusion Model (KGDiff) in this paper through the following key innovations. First, we design a geometry-aware mixture of experts integrating Euclidean, hyperbolic, and hyperspherical representations to comprehensively encode urban structural knowledge. Next, we present a learnable meta spatio-temporal pattern module that normalizes node-specific heterogeneity before the generation process, and a conditional denoising process that progressively transforms random noise into realistic samples under structural guidance. Finally, extensive experiments across real-world urban datasets demonstrate that KGDiff achieves the state-of-art performance in generating realistic urban spatio-temporal data.

Downloads

Next from AAAI 2026

BD-Net: Has Depth-Wise Convolution Ever Been Applied in Binary Neural Networks?

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

BD-Net: Has Depth-Wise Convolution Ever Been Applied in Binary Neural Networks?

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads