Singapore

The strong lottery ticket hypothesis (SLTH) conjectures that high-performing subnetworks, called strong lottery tickets (SLTs), are hidden in randomly initialized neural networks.
Although recent theoretical studies have established the SLTH across various neural architectures, the SLTH for transformer architectures still lacks theoretical understanding.
In particular, the current theory of the SLTH does not yet account for the multi-head attention (MHA) mechanism, a core component of transformers.
To address this gap, we introduce a theoretical analysis of the existence of SLTs within MHAs.
We prove that, if a randomly initialized MHA of $H$ heads and input dimension $d$ has the hidden dimension $O(d\log(Hd^{3/2}))$ for the key and value, it contains an SLT that approximates an arbitrary MHA with the same input dimension with high probability.
Furthermore, by leveraging this theory for MHAs, we extend the SLTH to transformers without normalization layers.
We empirically validate our theoretical findings, demonstrating that the approximation error between the SLT within a source model (MHA and transformer) and an approximate target counterpart decreases exponentially by increasing the hidden dimension of the source model.

AAAI 2026

The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms

ml: deep neural architectures and foundation models

ml: deep learning theory

ml: learning on the edge & model compression

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Structured light (SL) 3D reconstruction captures the precise surface shape of objects, providing high-accuracy 3D data essential for industrial inspection and cultural heritage digitization. However, existing methods suffer from two key limitations: reliance on scene-specific calibration with manual parameter tuning, and optimization frameworks tailored to specific SL patterns, limiting their generalizability across varied scenarios. We propose General and Unified Structured Light Optimization (GUSLO), a novel framework addressing these issues through two coordinated innovations: (1) single-shot calibration via 2D triangulation-based interpolation that converts sparse matches into dense correspondence fields, and (2) artifact-aware photometric adaptation via explicit transfer functions, balancing generalization and color fidelity. We conduct diverse experiments covering binary, speckle, and color-coded settings. Results show that GUSLO consistently improves accuracy and cross-encoding robustness over conventional methods in challenging industrial and cultural scenarios.

GUSLO: General and Unified Structured Light Optimization

Multimodal large language models (MLLMs) frequently hallucinate by over-committing to spurious visual cues. Prior remedies—Visual and Instruction Contrastive Decoding (VCD, ICD)—mitigate this issue, yet the mechanism remains opaque. We first empirically show that their improvements systematically coincide with redistributions of cross-modal attention. Building on this insight, we propose Attention-Steerable Contrastive Decoding (ASCD), which directly steers the attention scores during decoding. ASCD combines (i) positive steering, which amplifies automatically mined text-centric heads—stable within a model and robust across domains—with (ii) negative steering, which dampens on-the-fly identified critical visual tokens. The method incurs negligible runtime/memory overhead and requires no additional training. Across five MLLM backbones and three decoding schemes, ASCD reduces hallucination on POPE, CHAIR, and MMHal-Bench by up to 38.2% while improving accuracy on standard VQA benchmarks, including MMMU, MM-VET, ScienceQA, TextVQA, and GQA. These results position attention steering as a simple, model-agnostic, and principled route to safer, more faithful multimodal generation.

ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM

Few-shot Video Object Detection addresses the challenge of detecting novel objects in videos with limited labeled examples, overcoming the constraints of traditional detection methods that require extensive training data. This task presents key challenges, including maintaining temporal consistency across frames affected by occlusion and appearance variations, and achieving novel object generalization without relying on complex region proposals. Our novel object-aware temporal modeling approach addresses these challenges by incorporating a filtering mechanism that selectively propagates high-confidence object features across frames. This enables efficient feature progression, reduces noise accumulation, and enhances detection accuracy in few-shot scenarios. By utilizing few-shot trained detection and classification heads with focused feature propagation, we achieve robust temporal consistency without depending on explicit object tube proposals. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, with significant improvements of 4.3%, 5.9%, 4.0%, and 5.9% in AP on FSVOD-500, FSYTV-40, VidOR, and VidVRD datasets, respectively, in the 5-shot setup. Our approach maintains consistent performance gains across 1-shot, 3-shot, and 10-shot configurations, validating its effectiveness across diverse evaluation scenarios. We will make our code base public upon acceptance of the work.

Temporal Object-Aware Vision Transformer for Few-Shot Video Object Detection

Endpoint Detection and Response (EDR) systems are a cornerstone of modern threat detection and endpoint protection. However, conventional heuristic- and learning-based approaches often fail to address sophisticated and continuously evolving attack patterns. Recent advances in large language models (LLMs) offer promising capabilities for behavioral analysis in EDR logs, yet their effectiveness is hindered by the high volume of events and the interleaved nature of behavior sequences---posing significant challenges for long-context modeling and stealthy threat detection. To address these issues, we propose HyperGLLM, a novel detection framework that introduces hypergraph reasoning into LLMs. It first constructs an attribute-value level relation-aware graph to model low-order structural semantics while reducing textual redundancy. Then, it introduces a differential hypergraph module with multi-granularity clustering to capture high-order behavioral dependencies embedded in interleaved events and reinforce threat semantics. Finally, the hypergraph representations are aligned with an LLM for efficient contextual reasoning over potential malicious behaviors. To facilitate empirical evaluation, we curate EDR3.6B-63F, a large-scale EDR dataset containing 3.6 billion events across 63 distinct behavior families. Extensive experiments demonstrate that HyperGLLM significantly outperforms state-of-the-art methods by reducing the false alarm rate to 1.67\%, achieving 94.65\% accuracy across 63 behavior families, and improving the modeling efficiency of LLMs on long EDR logs. Our framework and dataset provide a solid foundation for future research and support the development of advanced detection solutions in endpoint security.

HyperGLLM: An Efficient Framework for Endpoint Threat Detection via Hypergraph-Enhanced Large Language Models

Specifying informative and dense reward functions remains a pivotal challenge in Reinforcement Learning, as it directly affects the efficiency of agent training. In this work, we harness the expressive power of quantitative Linear Temporal Logic on finite traces (($\text{LTL}_f[\mathcal{F}]$)) to synthesize reward monitors that generate a dense stream of rewards for runtime-observable state trajectories. By providing nuanced feedback during training, these monitors guide agents toward optimal behaviour and help mitigate the well-known issue of sparse rewards under long-horizon decision making, which arises under the Boolean semantics dominating the current literature. Our framework is algorithm-agnostic and only relies on a state labelling function, and naturally accommodates specifying non-Markovian properties. Empirical results show that our quantitative monitors consistently subsume and, depending on the environment, outperform Boolean monitors in maximizing a quantitative measure of task completion and in reducing convergence time.

Expressive Temporal Specifications for Reward Monitoring

Tensor Compressive Sensing (TCS) has gained significant attention recently due to its strong ability to preserve the multidimensional structure of data. However, existing TCS methods face three critical challenges: 1) Biased approximation of tensor rank imposed by the convex surrogate Tensor Nuclear Norm (TNN) may interfere with the original low-rank structure of tensor data. 2) Vulnerability to non-Gaussian noise and outliers makes TCS methods highly susceptible to complex noise environments ubiquitous in real-world applications. 
3) Most of them are confined to third-order tensors and cannot handle high-order tensor data effectively. Being aware of these, we propose Robust Tensor Compressive Sensing (RTCS) based on M-estimators with three key innovations: 1) We design a novel M-estimator-based low-rank regularizer for order-$d$ ($d \geq 3$) tensors, which provides a superior approximation of tensor rank and better preserves the original data structure. 2) RTCS incorporates a robust Welsch estimator that adaptively mitigates the influence of complex noises and outliers in tensor recovery. 3) RTCS is developed to handle high-order tensors, thereby allowing for broader applicability beyond conventional third-order tensors. We further design an efficient algorithm based on the Alternating Direction Method of Multipliers (ADMM) to handle the complex optimization problem. Experiments show that RTCS consistently outperforms existing approaches across various noises.

Robust High-Order Tensor Compressive Sensing Based on M-Estimators

Despite the remarkable progress of Auto-Regressive (AR) image generation, its inference latency remains high due to the AR nature and the ambiguity of image tokens—even when employing Speculative Decoding (SD). Recent works have empirically addressed this issue using relaxed SD, but without theoretical grounding. In this paper, we establish the theoretical foundations of relaxed SD and propose Annealed Relaxation of Speculative Decoding (AnnealRSD), grounded in two key insights. First, by analyzing the total variation (TV) distance between the target model and relaxed SD, we derive the optimal resampling distribution that minimizes an upper bound of the TV distance. Second, perturbation analysis reveals an inherent annealing property of relaxed SD, motivating our annealed design. Together, these components enable AnnealRSD to achieve faster generation with comparable quality, or improved quality at the same latency, compared to existing methods. Extensive experiments on image generation validate the effectiveness of AnnealRSD, showing consistent improvements over prior approaches in speed and quality trade-offs.

Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation

The rapid advancement of generative models has increased the demand for detectors capable of generalizing across diverse and evolving generation techniques. However, existing methods—including those leveraging pre-trained vision-language models—often produce highly entangled representations, mixing task-relevant forensic cues (causal features) with spurious or irrelevant patterns (non-causal features), thus limiting generalization. To address this issue, we propose CausalCLIP, a framework that explicitly disentangles causal from non-causal features and employs targeted filtering guided by causal inference principles to retain only the most transferable and discriminative forensic cues. By modeling the generation process with a structural causal model and enforcing statistical independence through Gumbel-Softmax-based feature masking and Hilbert-Schmidt Independence Criterion (HSIC) constraints, CausalCLIP isolates stable causal features robust to distribution shifts. Extensive experiments across various generation models and datasets demonstrate that CausalCLIP significantly improves generalization, achieving gains of 4.06\% in average precision and 6.82\% in accuracy compared to existing state-of-the-art methods. The source code will be publicly available upon publication.

CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images

Multimodal Large Language Models (MLLM) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use "Visual Prompts" (VP) like bounding boxes to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap raises uncertainty about whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and utilize them to solve problems. To address this limitation, we introduce VP-Bench, aiming to assess MLLMs’ capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models’ ability to perceive VPs in natural scenes, utilizing 100K visualized prompts spanning 8 shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 21 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL-2.5 and Qwen2.5-VL). In addition, we conduct a comprehensive analysis of the factors influencing VP understanding, such as attribute variations and model scale. VP-Bench establishes a new reference framework for studying MLLMs’ ability to comprehend and resolve grounded referring questions.

VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models

Long-tail recognition remains challenging for pre-trained foundation models like CLIP, which often suffer from performance degradation under imbalanced data. This stems not only from the overfitting/underfitting issues during fine-tuning but, more fundamentally, from the inherent bias inherited from the long-tail distribution of their massive pre-training datasets. To address this, we propose \textbf{HGLTR} (Hierarchy-Guided Long-Tail Recognition), a novel framework that calibrates pre-trained models by injecting objective class hierarchy knowledge. We argue that the semantic proximity defined by a hierarchy provides a robust, data-independent prior to counteract model bias. Our method is specifically designed for vision-language models' dual-modality architecture. At the feature level, we align image embeddings with a hierarchy-guided text similarity structure. At the classifier level, we employ a distillation loss to regularize predictions using soft labels derived from the hierarchy. This dual-level injection effectively transfers knowledge from head to tail classes. Experiments on ImageNet-LT, Places-LT, and iNaturalist 2018 demonstrate that HGLTR achieves state-of-the-art performance, particularly in tail-class accuracy, highlighting the importance of leveraging structural priors to calibrate foundation models for real-world data.

Downloads

Next from AAAI 2026

GUSLO: General and Unified Structured Light Optimization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES