Singapore

Vision-Language Models (VLMs) encode images into lengthy
sequences of visual tokens, leading to excessive
computational overhead and limited inference efficiency. In
this paper, we study the hierarchical attention pattern in
vision encoders and propose HiPrune, a training-free and
model-agnostic token Pruning framework for VLMs. We
identify that middle layers in the vision encoder attend to
object-centric regions, while deep layers capture global
contextual features. Based on this observation, HiPrune
selects tokens based on the attention score from the middle
and deep layers. Our method requires no retraining and
integrates seamlessly with any ViT-based VLM. Experiments
demonstrate that HiPrune achieves outstanding pruning
performance, maintaining a balance between efficiency and
efficacy.

AAAI 2026

HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models (Student Abstract)

cv: large vision models

ml: large multimodal models (lmms)

cv: language and vision

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Accurately modeling opponent behaviors and integrating
strategy are key challenges for multi-issue automated
negotiation. Existing approaches often isolate preference
learning or trend prediction and lack a unified cognitive
structure with coordinated reasoning. This paper proposes a
BDI (Belief-Desire-Intention)-based opponent modeling and
strategy generation framework. The framework analyzes
opponent responses (Belief), predicts preference weights
and the utility function (Desire), and infers utilities of
future offers (Intention). Building on these predictions,
we design a responsive strategy, enabling gradual
concessions and balanced outcomes. Our main contributions
are: D-MBUE in the Desire module, I-DABI in the Intention
module, and the BDI-Driven Negotiator on top of the
modeling modules. Experiments on 45 standard negotiation
domains and against 13 representative opponents demonstrate
the effectiveness of our BDI framework.

BDI-based Opponent Modeling and Strategy Generation for Multi-Issue Negotiation (Student Abstract)

Rip currents cause over 100 drowning deaths and more than 30,000 rescues annually in the United States, posing a severe threat to beach safety worldwide. However, most existing detection methods are reactive, identifying rip currents only after they form, leaving limited time for intervention. We propose RipAlert, a future-frame-aware framework that forecasts near-future coastal dynamics and proactively identifies rip current risks. We design a region-sensitive optical flow prediction method with a novel entropy-based object detector to capture early-stage reverse-flow anomalies. Unlike static-image approaches, RipAlert leverages temporal motion patterns to detect rip currents up to 5 seconds before they visibly form. To support real-world deployment, we design a lightweight mobile application and release a curated dataset with over 2,000 annotated images. Experiments on the RipVIS benchmark show that our approach achieves state-of-the-art performance. The system has been deployed at high-risk beaches in China, issuing successful early warnings over real-world events. Our work advances AI-driven coastal safety and contributes to SDG 3 (Good Health and Well-Being) and SDG 13 (Climate Action).

RipAlert: A Future-Frame-Aware Framework for Rip Current Forecasting and Early Alerting

Dynamic Text-Attributed Graphs (DyTAGs) are a novel graph paradigm that captures evolving temporal events (edges) alongside rich textual attributes. Existing studies can be broadly categorized into TGNN-driven and LLM-driven approaches, both of which encode textual attributes and temporal structures for DyTAG representation. We observe that DyTAGs inherently comprise three distinct modalities: temporal, textual, and structural, often exhibiting completely disjoint distributions. However, the first two modalities are largely overlooked by existing studies, leading to suboptimal performance. To address this, we propose MoMent, a multi-modal network that explicitly models, integrates, and aligns each modality to learn node representations for link prediction. Given the disjoint nature of the original modality distributions, we first construct modality-specific features and encode them using individual encoders to capture correlations across temporal patterns, semantic context, and local structures. Each encoder generates modality-specific tokens, which are then fused into comprehensive node representations with a theoretical guarantee. To avoid disjoint subspaces of these heterogeneous modalities, we propose a dual-domain alignment loss that first aligns their distributions globally and then fine-tunes coherence at the instance level. This enhances coherent representations from temporal, textual, and structural views. Extensive experiments across seven datasets show that MoMent achieves up to 17.28% accuracy improvement and up to 31x speed-up against eight baselines.

Unlocking Multi-Modal Potentials for Link Prediction on Dynamic Text-Attributed Graphs

Spiking neural networks (SNNs) offer a promising energy-efficient alternative to artificial neural networks, due to their event-driven spiking computation. 
However, some foundation SNN backbones (including Spikformer and SEW ResNet) suffer from non-spike computations (integer-float multiplications) caused by the structure of their residual connections. These non-spike computations increase SNNs' power consumption and make them unsuitable for deployment on mainstream neuromorphic hardware. 
In this paper, we analyze the spike-driven behavior of the residual connection methods in SNNs. We then present Spikingformer, a novel spiking transformer backbone that merges the MS Residual connection with Self-Attention in a biologically plausible way to address the non-spike computation challenge in Spikformer while maintaining global modeling capabilities.
We evaluate Spikingformer across 13 datasets spanning large static images, neuromorphic data, and natural language tasks, and demonstrate the effectiveness and universality of Spikingformer, setting a vital benchmark for spiking neural networks.
In addition, with the spike-driven features and global modeling capabilities, Spikingformer is expected to become a more efficient general-purpose SNN backbone towards energy-efficient artificial intelligence.

Spikingformer: A Key Foundation Model for Spiking Neural Networks

Infrared video has been of great interest in visual tasks under challenging environments, but often suffers from severe atmospheric turbulence and compression degradation. Existing video super-resolution (VSR) methods either neglect the inherent modality gap between infrared and visible images or fail to restore turbulence-induced distortions. Directly cascading turbulence mitigation (TM) algorithms with VSR methods leads to error propagation and accumulation due to the decoupled modeling of degradation between turbulence and resolution. We introduce \textbf{HATIR}, a \textbf{H}eat-\textbf{A}ware Diffusion for \textbf{T}urbulent \textbf{I}nfra\textbf{R}ed Video Super-Resolution, which injects heat-aware deformation priors into the diffusion sampling path to jointly model the inverse process of turbulent degradation and structural detail loss. Specifically, HATIR constructs a Phasor-Guided Flow Estimator, rooted in the physical principle that thermally active regions exhibit consistent phasor responses over time, enabling reliable turbulence-aware flow to guide the reverse diffusion process. To ensure the fidelity of structural recovery under nonuniform distortions, a Turbulence-Aware Decoder is proposed to selectively suppress unstable temporal cues and enhance edge-aware feature aggregation via turbulence gating and structure-aware attention. We built FILR-IVSR, the first dataset for turbulent infrared VSR, comprising paired LR-HR sequences from a FILR T1050sc camera ($1024 \times 768$) spanning 645 diverse scenes with varying camera and object motion conditions. This encourages future research in infrared VSR.

HATIR: Heat-Aware Diffusion for Turbulent Infrared Video Super-Resolution

Pan-sharpening aims to generate high-resolution multispectral (HRMS) images by integrating a high-resolution panchromatic (PAN) image with its corresponding low-resolution multispectral (MS) image. To achieve effective fusion, it is crucial to fully exploit the complementary information between the two modalities. Traditional CNN-based methods typically rely on channel-wise concatenation with fixed convolutional operators, which limits their adaptability to diverse spatial and spectral variations. While cross-attention mechanisms enable global interactions, they are computationally inefficient and may dilute fine-grained correspondences, making it difficult to capture complex semantic relationships. Recent advances in the Multimodal Diffusion Transformer (MMDiT) architecture have demonstrated impressive success in image generation and editing tasks. Unlike cross-attention, MMDiT employs in-context conditioning to facilitate more direct and efficient cross-modal information exchange. In this paper, we propose MMMamba, a cross-modal in-context fusion framework for pan-sharpening, with the flexibility to support image super-resolution in a zero-shot manner. Built upon the Mamba architecture, our design ensures linear computational complexity while maintaining strong cross-modal interaction capacity. Furthermore, we introduce a novel multimodal interleaved (MI) scanning mechanism that facilitates effective information exchange between the PAN and MS modalities. Extensive experiments demonstrate the superior performance of our method compared to existing state-of-the-art (SOTA) techniques across multiple tasks and benchmarks.

MMMamba: A Versatile Cross-Modal in Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement

Researchers strategically choose where to submit their work in order to maximize its impact, and these publication decisions in turn determine venues’ impact factors. To analyze how individual publication choices both respond to and shape venue impact, we introduce a game-theoretic framework—coined the *Publication Choice Problem*—that captures this two‐way interplay. We show the existence of a pure-strategy equilibrium in the Publication Choice Problem and its uniqueness under binary researcher types. Our characterizations of the equilibrium properties offer insights about what publication behaviors better indicate a researcher's impact level and reveal how the disproportionate scaling of high-impact and low-impact researchers can result in the fluctuation in the impact of publication venues. Through equilibrium analysis, we further investigate how labeling top papers with "spotlight" affects the impact factor of venues in the research community.

The Publication Choice Problem

Large language models now draft news, legal analyses, and software code with human-level fluency. At the same time, regulations such as the EU AI Act mandate that each synthetic passage carry an imperceptible, machine-verifiable mark for provenance. Conventional logit-based watermarks satisfy this requirement by selecting a pseudorandom green vocabulary at every decoding step and boosting its logits, yet the random split can exclude the highest-probability token and thus erode fluency. WaterMod mitigates this limitation through a probability-aware modular rule. The vocabulary is first sorted in descending model probability; the resulting ranks are then partitioned by the residue $\text{rank}\bmod k$, which distributes adjacent—and therefore semantically similar—tokens across different classes.
A fixed bias of small magnitude is applied to one selected class. In the zero-bit setting ($k=2$), an entropy-adaptive gate selects either the even or the odd parity as the green list. Because the top two ranks fall into different parities, this choice embeds a detectable signal 
while guaranteeing that at least one high-probability token remains available for sampling. In the multi-bit regime ($k>2$), the current payload digit $d$ selects the color class whose ranks satisfy $\text{rank} \bmod k = d$. Biasing the logits of that class embeds exactly 
one base-$k$ digit—equivalently $\log_{2}k$ bits—per decoding step, thereby enabling fine-grained provenance tracing. The same modular arithmetic therefore supports both binary attribution and rich payloads. Experimental results demonstrate that WaterMod consistently attains strong watermark detection performance while maintaining generation quality in both zero-bit and multi-bit settings. This robustness holds across a range of tasks, including natural language generation, mathematical reasoning, and code synthesis. Our code and data are available at \url{https://github.com/Shinwoo-Park/WaterMod}.

WaterMod: Modular Token-Rank Partitioning for Probability-Balanced LLM Watermarking

Federated learning (FL) allows for collaborative model training while preserving data privacy, but its distributed nature makes it vulnerable to poisoning attacks. Existing defense methods typically rely on using gradients from multiple clients to define a trusted region, selecting only the trustworthy update (good gradients) within this region for aggregation. Mainstream defense boundaries are categorized as hard
boundaries, soft boundaries, and semi-soft boundaries. However, we argue that even good gradients within these boundaries can still be exploited by attackers to poison the model. To tackle this challenge, we introduce a boundary-adaptive attack method that leverages the directional properties of optimization techniques to derive baseline poisoned gradients. Through iterative perturbation, it generates seemingly innocent gradients that subtly deviate from the global model. Our extensive study on 3 benchmark datasets and 13 mainstream defensive mechanisms confirms that the proposed attack raises a significantly severe threat to the integrity and security of federated learning practices, regardless of the flourishing of robust Federated Learning methods.

Good Gradients Poison Your Model: Evading Defenses in Federated Learning via Boundary-adaptive Perturbation

The rapid proliferation of Generative AI (GenAI) into diverse, high-stakes domains necessitates robust and reproducible evaluation methods. However, practitioners often resort to ad-hoc, non-standardized scripts, as common metrics are often unsuitable for specialized, structured outputs (e.g., automated plans, time-series) or holistic comparison across modalities (e.g., text, audio, and image). This fragmentation hinders comparability and slows AI system development. To address this challenge, we present GAICo (Generative AI Comparator): a deployed, open-source Python library that streamlines and standardizes GenAI output comparison. GAICo provides a unified, extensible framework supporting a comprehensive suite of reference-based metrics for unstructured text, specialized structured data formats, and multimedia (images, audio). Its architecture features a high-level API for rapid, end-to-end analysis, from multi-model comparison to visualization and reporting, alongside direct metric access for granular control. We demonstrate GAICo's utility through a detailed case study evaluating and debugging complex, multi-modal AI Travel Assistant pipelines. GAICo empowers AI researchers and developers to efficiently assess system performance, make evaluation reproducible, improve development velocity, and ultimately build more trustworthy AI systems, aligning with the goal of moving faster and safer in AI deployment. **Since its release on PyPI in Jun 2025, the tool has been downloaded over 13K times, across versions, by Aug 2025, demonstrating growing community interest.**

Downloads

Next from AAAI 2026

BDI-based Opponent Modeling and Strategy Generation for Multi-Issue Negotiation (Student Abstract)

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES