Singapore

Mask Diffusion Models (MDMs) have recently emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks, owing to their flexible balance of efficiency and accuracy. In this paper, for the first time, we introduce MDMs into the Scene Text Recognition (STR) task. We show that vanilla MDM still lags behind ARMs in terms of accuracy, although it improves recognition efficiency. To bridge this gap, we propose MDiff4STR, a Mask Diffusion model with self-reflection mechanism for Scene Text Recognition. Specifically, we identify two key challenges in applying MDMs to STR: noising gap between training and inference, and overconfident predictions during inference. Both significantly hinder the performance of MDMs. To mitigate the first issue, we develop six noising strategies that better align training with inference behavior. For the second, we propose a novel self-reflection mechanism that compels MDiff4STR to reflect on and revise overly confident yet incorrect predictions. We conduct extensive evaluations of MDiff4STR on both standard and challenging STR benchmarks, covering diverse scenarios including irregular, artistic, occluded, and Chinese text, as well as whether the use of pretraining. Across these settings, MDiff4STR consistently outperforms popular STR models, surpassing state-of-the-art ARMs in accuracy. Meanwhile, it also maintains fast inference because it only requires three denoising steps. Code will be released.

AAAI 2026

MDiff4STR: Mask Diffusion Model for Scene Text Recognition

cv: scene analysis & understanding

cv: applications

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Detecting Alzheimer’s disease (AD) from narrative transcripts challenges large language models (LLMs): pre-training rarely covers this out-of-distribution task, and all transcript demos describe the same scene, producing highly homogeneous contexts. These factors cripple both the model’s built-in task knowledge ($\textbf{task cognition}$) and its ability to surface subtle, class-discriminative cues ($\textbf{contextual perception}$). Because cognition is fixed after pre-training, improving in-context learning (ICL) for AD detection hinges on enriching perception through better demonstration (demo) sets. We demonstrate that standard ICL quickly saturates, its demos lack diversity (context width) and fail to convey fine-grained signals (context depth), and that recent task vector (TV) approaches improve broad task adaptation by injecting TV into the LLMs' hidden states, they are ill-suited for AD detection due to the mismatch of injection granularity, strength and position. To address these bottlenecks, we introduce $\textbf{DA4ICL}$, a demo-centric anchoring framework that jointly expands context width via $\textbf{\emph{Diverse and Contrastive Retrieval}}$ (DCR) and deepens each demo's signal via $\textbf{\emph{Projected Vector Anchoring}}$ (PVA) at every Transformer layer. Across three AD benchmarks, DA4ICL achieves large, stable gains over both ICL and TV baselines, charting a new paradigm for fine-grained, OOD and low-resource LLM adaptation.

Beyond Plain Demos: A Demo-Centric Anchoring Paradigm for In-Context Learning in Alzheimer’s Disease Detection

In recent years, human-AI cognitive consistency has emerged as a crucial perspective for evaluating the perceptual quality and interpretability of AIGC (Artificial Intelligence Generated Content). This paper proposes a biologically inspired saliency prediction framework that models six core regions of the human visual system—namely V1, V2, V4, MT, LIP, and FEF—using liquid neurons to capture the dynamic saliency features aligned with human gaze behavior. To enable effective alignment between AIGC models and human cognitive mechanisms, we introduce a cross-domain dual-teacher distillation strategy and construct a large-scale multimodal dataset comprising natural images, eye-tracking data, AIGC-generated images, and their corresponding cross-attention maps. Furthermore, we propose HAMCI (Human-AI Mutual Cognitive Index), a novel metric designed to quantitatively assess the spatial and semantic alignment between predicted saliency maps and model attention distributions. The proposed method demonstrates promising performance across various saliency prediction and cognitive alignment tasks, with results comparable to or surpassing recent state-of-the-art methods in several benchmarks. The code and dataset will be released upon acceptance to facilitate future research on cognitively aligned AIGC evaluation.

A Brain-Inspired Saliency Prediction Framework for Human-AI Cognitive Consistency in AIGC Content via Multi-Region Liquid Neurons

Accurate medical time series (MedTS) classification is essential for effective clinical diagnosis, yet remains challenging due to complex multi-channel temporal dependencies, information redundancy, and label scarcity.
While transformer-based models have shown promise in time series analysis, most are designed for forecasting tasks and fail to fully exploit the unique characteristics of MedTS.
In this paper, we introduce MedSpaformer, a transformer-based framework tailored for MedTS classification. It incorporates a sparse token-based dual-attention mechanism that enables global context modeling and token sparsification, allowing dynamic feature refinement by focusing on informative tokens while reducing redundancy.
This mechanism is integrated into a multi-granularity cross-channel encoding scheme to capture intra- and inter-granularity temporal dependencies and inter-channel correlations, enabling progressive refinement of task-relevant patterns in medical signals.
The sparsification design allows our model to flexibly accommodate inputs with variable lengths and channel dimensions. We also introduce an adaptive label encoder to extract label semantics and address cross-dataset label space misalignment. Together, these components enhance the model’s transferability across heterogeneous medical datasets, which helps alleviate the challenge of label scarcity.
Our model outperforms 13 baselines across 7 medical datasets under supervised learning. It also excels in few-shot learning and demonstrates zero-shot capability in both in-domain and cross-domain diagnostics.
These results highlight MedSpaformer's robustness and its potential as a unified solution for MedTS classification across diverse settings.

MedSpaformer: A Transferable Transformer with Multi-Granularity Token Sparsification for Medical Time Series Classification

Fine-tuning plays an essential role in improving the performance of large language models (LLMs) on specific tasks. A central challenge lies in designing data-efficient strategy to achieve better fine-tuning performance. Curriculum learning, which organizes data from easy to hard, has become a widely adopted technique in LLMs training. However, existing methods for curriculum learning focus only on the difficulty of samples, while neglecting their contribution to improving model performance, making them vulnerable when applied to fine-tuning LLMs. To address this, we propose Difficulty-Utility Curriculum Learning (DUCL), a curriculum learning framework that jointly considers difficulty and utility. DUCL introduces a novel scoring method, Difficulty-Utility Evaluation (DUE), and a soft scheduling strategy called Window Ordering, which together promote efficient and effective fine-tuning. Our method not only improves convergence and final performance with negligible computational overhead, but is also broadly applicable across a wide range of tasks, making it a practical and scalable solution for LLMs fine-tuning.

Difficulty Is Not Enough: Curriculum Learning for LLMs Fine-tuning Must Consider Utility

Parametric multi-objective optimization (PMO) addresses the challenge of solving an infinite family of multi-objective optimization problems, where optimal solutions must adapt to varying parameters. Traditional methods require re-execution for each parameter configuration, leading to prohibitive costs when objective evaluations are computationally expensive. To address this issue, we propose Parametric Pareto Set Learning with multi-objective Bayesian Optimization (PPSL-MOBO), a novel framework that learns a unified mapping from both preferences and parameters to Pareto-optimal solutions. PPSL-MOBO leverages a hypernetwork with Low-Rank Adaptation (LoRA) to efficiently capture parametric variations, while integrating Gaussian process surrogates and hypervolume-based acquisition to minimize expensive function evaluations. We demonstrate PPSL-MOBO's effectiveness on two challenging applications: multi-objective optimization with shared components, where certain design variables must be identical across solution families due to modular constraints, and dynamic multi-objective optimization, where objectives evolve over time. Unlike existing methods that cannot directly solve PMO problems in a unified manner, PPSL-MOBO learns a single model that generalizes across the entire parameter space. By enabling instant inference of Pareto sets for new parameter values without retraining, PPSL-BO provides an efficient solution for expensive PMO problems.

Parametric Pareto Set Learning for Expensive Multi-Objective Optimization

The widespread application of Large Language Models (LLMs) has motivated a growing interest in their capacity for processing dynamic graphs. Temporal motifs, as an elementary unit and an important local property of dynamic graphs which can directly reflect anomalies and unique phenomena, are essential for understanding their evolutionary dynamics and structural features. However, leveraging LLMs for temporal motif analysis on dynamic graphs remains relatively unexplored. In this paper, we systematically study LLM performance on temporal motif-related tasks. Specifically, we propose a comprehensive benchmark, LLMTM (Large Language Models in Temporal Motifs), which includes six tailored tasks across nine temporal motif types. We then conduct extensive experiments to analyze the impacts of different prompting techniques and LLMs on model performance. Informed by our benchmark findings, we develop a tool-augmented LLM agent that leverages meticulously engineered prompts to solve these tasks with high accuracy. Nevertheless, the high accuracy of the agent incurs a substantial cost. To address this trade-off, we propose a simple yet effective structure-aware dispatcher that considers both the dynamic graph's structural properties and the LLM's cognitive load to intelligently dispatch queries between the standard LLM prompting and the more powerful agent. Our experiments demonstrate that the structure-aware dispatcher effectively maintains high accuracy while reducing cost. Our data and code are publicly available on https://anonymous.4open.science/r/ghjfvhghjvk654165.

LLMTM: Benchmarking and Optimizing LLMs for Temporal Motif Analysis in Dynamic Graphs

Transmitting and receiving electromagnetic wave signals reflected back to the ground can detect the structure of sub surface defects. However, the imaging process of ground- penetrating radar (GPR) is highly susceptible to interference from complex underground environments, leading to nonlinear attenuation and noise. This makes it challenging to directly localize and identify defect types from raw reflected radar waveform images. Currently, mainstream methods of manual radar signal gain and filtering heavily rely on expert experience, while common end-to-end generative models are typically designed for optical images. This paper proposes a defect-guided Multi-window Gabor Transform Network (MGT-Net) for GPR B-Scan image reconstruction which achieves automatic gain and defect enhancement of raw GPR signals. Firstly, a Multi-window Gabor Transform Module (MGTM) is designed to effectively represent and extract spatial-frequency features of defects at different locations and of various types. Secondly, a defect guidance network (DG-Net) is constructed to accurately direct the reconstruction of defect areas and enhance the saliency and discriminability of defect features. Additionally, we construct a large-scale GPR B-Scan image dataset (GRD) containing 41,613 images across 7 defect categories. Experimental results show the superior performance of MGT-Net, achieving state-of-the-art (SOTA) SSIM of 81.72% ± 3.5% and PSNR of 30.50 ± 0.442.

Multi-Window Gabor Transform Network for Ground Penetrating Radar B-Scan Image Reconstruction

We introduce the Block Rearrangement Problem (BRaP), a challenging component of large warehouse management which involves rearranging storage blocks within dense grids to achieve a target state. We formally define the BRaP as a graph search problem. Building on intuitions from sliding puzzle problems, we propose five search-based solution algorithms, leveraging joint configuration space search, classical planning, multi-agent pathfinding, and expert heuristics. We evaluate the five approaches empirically for plan quality and scalability. Despite the exponential relation between search space size and grid density, our methods demonstrate efficiency in creating rearrangement plans for deeply buried blocks in up to 80x80 grids.

Symbolic Planning and Multi-Agent Path Finding in Extremely Dense Environments with Unassigned Agents

The \textsc{Ride-Sharing Assignment Problem} (AAAI 2018) is a fundamental problem in intelligent transportation systems, urban mobility, and algorithmic decision-making. Given a set of $m$ vehicles with initial locations and $n$ requests ($n \leq mk$), each with a specified origin and destination, the goal is to assign at most $k$ requests to each vehicle and compute corresponding routes that minimize the total travel distance. The algorithmic approach depends on whether $n = mk$ or $n < mk$. In this paper, we present algorithms with provable approximation guarantees for both cases. When $n = mk$, we design a $\min\{\mathcal{O}(\sqrt{k}), \mathcal{O}(\sqrt{\frac{n}{k}})\}$-approximation algorithm, whereas previously the ratio $\mathcal{O}(\sqrt{k})$ was only proved for $k$ being a power of 2. When $n < mk$, we achieve an approximation ratio of $\mathcal{O}(\sqrt{k} \log \max\{n, m\})$, breaking the natural $\mathcal{O}(k)$ barrier. We also conduct experiments to evaluate the empirical performance of our algorithms. The results show that our solutions consistently outperform those produced by the previous existing algorithm.

Improved Algorithms for Trip-Vehicle Assignment in Ride-Sharing

The ability to self-revise is critical for AI agents. To maintain trust and foster positive perceptions, AI systems must correct their mistakes and adapt to users’ changing needs. We present a metacognitive architecture for self-revision in SAMI, an AI social agent deployed in Georgia Tech’s OMSCS program. Over the past ten semesters, SAMI has facilitated social connections for more than 11,000 students. Real-world deployments revealed frequent requests from students to revise the knowledge database, either to correct errors or to update their information. To address this need, we present a self-revision architecture that integrates Knowledge-Based AI (KBAI) and Generative AI (GenAI). The architecture (1) localizes the task requiring revision by introspecting on its self-model, (2) updates the knowledge database, and (3) communicates the revision process back to the user. We evaluate the framework using feedback cases derived from real student data and observed revision needs. This work introduces a novel metacognitive approach to improving explainability through the integration of KBAI and GenAI, with a clear path toward real-world deployment.

Downloads

Next from AAAI 2026

Beyond Plain Demos: A Demo-Centric Anchoring Paradigm for In-Context Learning in Alzheimer’s Disease Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Beyond Plain Demos: A Demo-Centric Anchoring Paradigm for In-Context Learning in Alzheimer’s Disease Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads