Singapore

Medical Large Vision-Language Models (Med-LVLMs) have shown promising results in clinical applications, but often suffer from hallucinated outputs due to misaligned visual understanding. In this work, we identify two fundamental limitations contributing to this issue: insufficient visual representation learning and poor visual attention alignment. To address these problems, we propose MedAlign, a simple, lightweight alignment distillation framework that transfers visual alignment knowledge from a domain-specific Contrastive Language-Image Pre-training (CLIP) model to Med-LVLMs. MedAlign introduces two distillation losses: a spatial-aware visual alignment loss based on visual token-level similarity structures, and an attention-aware distillation loss that guides attention toward diagnostically relevant regions. Extensive experiments on medical report generation and medical visual question answering (VQA) benchmarks show that MedAlign consistently improves both performance and interpretability, yielding more visually grounded outputs.

AAAI 2026

Enhancing Medical Large Vision-Language Models via Alignment Distillation

medical large vision-language models

multimodal alignment

hallucinations

knowledge distillation

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Qualitative spatial representation approaches which rely on Goodman-style predicative mereological theories and on a pseudo-topology, often causes some problems either for their use as a meta-information for knowledge conceptualization in advanced geometric reasoning, since they lack Euclidean geometry and fully-fledged topological spaces in the classical sense. Therefore, this paper seeks to extend an existing formalization, grounded in an underlying type theory using the \textit{Coq} language, together with the Whitehead-like point-free Tarski's geometry. More precisely, we leverage an available library called $\lambda$-MM to formalize Tarski’s geometry of solids by investigating an algebraic formulation of topological relations on top of the library. Given that Tarski’s work is grounded in Le{\'s}niewski’s mereology, and despite the fact that $\lambda$-MM barely implements Tarski's geometry, the first part of the paper supplements their work by proving that mereological classes correspond to regular open sets. It forms a topology of individual names extensible with Tarski’s geometric primitives. Unlike classical approaches used in qualitative logical theories, we adopt a solution that enables the specification of a topological space from mereology and a geometric subspace, thereby enhancing the theory’s expressiveness. Then, in a second part, we prove that Tarski’s geometry forms a subspace of the previous topology in which regions are restricted classes. We prove three postulates of Tarski’s work reducing his axiomatic system and extend the theory with the T2 (Hausdorff) property and additional definitions.

A Topological Rewriting of Tarski’s Mereogeometry

Large language models (LLMs) have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g., low-rank approximation or attention head pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy—a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in convolutional networks, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices (Q, K, V, O) into shared dictionary atoms, reducing the attention module’s parameters by 66.7\% (e.g., 226.5M → 75M in a 700M-parameter model) while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement—trained with standard optimizers—and represents each layer’s weights as linear combinations of shared matrix atoms. Experiments across scales (100M–700M parameters) show that MASA achieves better benchmark accuracy and perplexity than grouped-query attention (GQA), low-rank baselines and recently proposed Repeat-all-over/Sequential sharing at comparable parameter budgets. Ablation studies confirm robustness to the dictionary size and the efficacy of shared representations in capturing cross-layer statistical regularities. Extending to Vision Transformers (ViT), MASA matches performance metrics on image classification tasks with 66.7\% fewer attention parameters. By combining dictionary learning strategies with transformer efficiency, MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance. Finally, we investigate the possibility of employing MASA on large pretrained models to reduce their number of parameters without experiencing any significant drop in their performance.

Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

Local search is a fundamental method in operations research and combinatorial optimisation. It has been widely applied to a variety of challenging problems, including multi-objective optimisation where multiple, often conflicting, objectives need to be simultaneously considered. In multi-objective local search algorithms, a common practice is to maintain an archive of all non-dominated solutions found so far, from which the algorithm iteratively samples a solution to explore its neighbourhood. A central issue in this process is how to explore the neighbourhood of a selected solution. In general, there are two main approaches: 1) systematic exploration and 2) random sampling. The former systematically explores the solution's neighbours until a stopping condition is met -- for example, when the neighbourhood is exhausted (i.e., the best improvement strategy) or once a better solution is found (i.e., first improvement). In contrast, the latter randomly selects and evaluates only one neighbour of the solution. One may think systematic exploration may be more efficient, as it prevents from revisiting the same neighbours multiple times. In this paper, however, we show that this may not be the case. We first empirically demonstrate that the random sampling method consistently outperforms the systematic exploration method across a range of multi-objective problems, including 0-1 Knapsack, NK-Landscape, TSP and QAP. We then give an intuitive explanation for this phenomenon using toy examples, showing that the superior performance of the random sampling method relies on the distribution of ``good neighbours''. Next, we show that the number of such neighbours follows a certain probability distribution during the search. Lastly, building on this distribution, we provide a theoretical insight for why random sampling outperforms systematic exploration, regardless of whether the best improvement or first improvement strategy is used.

Random is Faster than Systematic in Multi-Objective Local Search

Brain decoding currently faces significant challenges in individual differences, modality alignment, and high-dimensional embeddings. To address individual differences, researchers often use source subject data, which leads to issues such as privacy leakage and heavy data storage burdens. In modality alignment, current works focus on aligning the softmax probability distribution but neglect the alignment of marginal probability distributions, resulting in modality misalignment. Additionally, images and text are aligned separately with fMRI without considering the complex interplay between images and text, leading to poor image reconstruction. Finally, the enormous dimensionality of CLIP embeddings causes significant computational costs. Although the dimensionality of CLIP embeddings can be reduced by ignoring the number of patches obtained from images and the number of tokens acquired from text, this comes at the cost of a significant drop in model performance, creating a dilemma. To overcome these limitations, we propose a source-free domain adaptation-based brain decoding framework. Firstly, we apply source-free domain adaptation, which only acquires the source model without accessing source data during target model adaptation, to brain decoding to address cross-subject variations, privacy concerns, and the heavy burden of data storage. Secondly, we employ maximum mean discrepancy (MMD) to align the marginal probability distributions between embeddings of different modalities. Moreover, to accommodate the complex interplay between image and text, we concatenate the embeddings of image and text and then use singular value decomposition (SVD) to obtain a new embedding. What’s more, to achieve better image generation quality, we employ the Wasserstein distance (WD) to align the probability distributions of new embeddings. Finally, in the target model adaptation phase of source-free domain adaptation, we employ low-rank adaptation (LoRA) to reduce the high
expense of tuning the target model. Sufficient experiments demonstrate our work outperforms state-of-the-art methods for brain decoding tasks.

Probability Distribution Alignment and Low-Rank Weight Decomposition for Source-Free Domain Adaptive Brain Decoding

Large language models (LLMs) have achieved impressive performance across a wide range of natural language processing tasks, yet they often produce hallucinated content that undermines factual reliability. To address this challenge, we introduce \textbf{HalluClean}, a lightweight and task-agnostic framework for detecting and correcting hallucinations in LLM-generated text. HalluClean adopts a \textbf{reasoning-enhanced paradigm}, explicitly decomposing the process into planning, execution, and revision stages to identify and refine unsupported claims. It employs \textbf{minimal task-routing prompts} to enable \textbf{zero-shot generalization} across diverse domains, without relying on external knowledge sources or supervised detectors. We conduct extensive evaluations on five representative tasks—question answering, dialogue, summarization, math word problems, and contradiction detection. Experimental results show that HalluClean significantly improves factual consistency and outperforms competitive baselines, demonstrating its potential to enhance the trustworthiness of LLM outputs in real-world applications.

HalluClean: A Unified Framework to Combat Hallucinations in LLMs

Graph neural networks (GNNs) have demonstrated strong performance in various data mining tasks but rely heavily on extensively labeled nodes. To improve training efficiency, graph active learning (GAL) has emerged as a solution for selecting the most informative nodes for labeling. However, existing GAL methods are primarily designed for homophilic graphs, and their performance on heterophilic graphs remains underexplored. In this work, we systematically study active learning on heterophilic graphs, a setting that has received limited attention. Surprisingly, we observe that existing GAL methods often fail to outperform naive random sampling on heterophilic graphs. Through an in-depth investigation, we reveal that these methods implicitly assume homophily even on heterophilic graphs, leading to suboptimal performance. To address this issue, we introduce the principle of ``Know Your Neighbors'' and propose an active learning algorithm KyN specifically for heterophilic graphs. The core idea of KyN is to provide GNNs with correct estimations of homophily distribution by labeling nodes together with their neighbors. We implement KyN based on subgraph sampling with probabilities proportional to $\ell_1$ Lewis weights, which is supported by solid theoretical guarantees. Extensive experiments on diverse real-world datasets, including a large heterophilic graph with over 2 million nodes, demonstrate the effectiveness and scalability of KyN.

Know Your Neighbors: Subgraph Importance Sampling for Heterophilic Graph Active Learning

Recent advances in naturalistic physical adversarial patch generation show great promise in protecting personal privacy against detector-based malicious surveillance while remaining inconspicuous to human observers. In this work, we present the first systematic categorization and in-depth re-examination of existing methods into three representative paradigms, revealing a pervasive imbalance: enforcing naturalness constraints inherently restricts the adversarial search space, thus limiting attack performance. To address this challenge, we propose a novel paradigm based on class-optimized diffusion, termed \textbf{Diff-NAT}. Diff-NAT leverages pretrained diffusion models as powerful natural image priors and introduces a unified iterative framework that jointly optimizes two complementary components: semantic-level textual prompts and instance-level latent codes. Specifically, prompt optimization enables broad traversal across inter-class semantic regions, while latent refinement allows for fine-grained manipulation within class objectives. This dual-level optimization facilitates progressive navigation toward adversarial distributions embedded within the natural semantic manifold. Extensive experiments in both digital and physical settings demonstrate that Diff-NAT outperforms existing SOTA approaches in terms of both visual realism and aggressiveness.

Diff-NAT: Better Naturalistic and Aggressive Adversarial Attacks via Class-Optimized Diffusion for Object Detection

Hate speech detection on Chinese social media platforms poses distinct challenges, particularly due to the widespread use of cloaking
techniques designed to evade conventional text-based detection systems. Although large language models (LLMs) have recently improved hate speech detection capabilities, the majority of existing work has concentrated on English datasets, with limited attention given to multimodal strategies in the Chinese context. In this study, we propose MMBERT, a novel BERT-based multimodal framework that integrates textual, speech, and visual modalities through a Mixture-of-Experts (MoE) architecture. To address the instability associated with directly integrating MoE into BERT-based models, we develop a progressive three-stage training paradigm. MMBERT incorporates modality-specific experts, a shared self-attention mechanism, and a router-based expert allocation strategy to enhance robustness against adversarial perturbations. Empirical results on several Chinese hate speech datasets show that MMBERT significantly surpasses fine-tuned BERT-based encoder models, fine-tuned LLMs, and LLMs utilizing in-context learning approaches.

MMBERT: Scaled Mixture-of-Experts Multimodal BERT for Robust Chinese Hate Speech Detection Under Cloaking Perturbations

Multi-view indoor radar perception has drawn attention due to its cost-effectiveness and low privacy risks. Existing methods often rely on implicit cross-view radar feature association, such as proposal pairing in RFMask or query-to-feature cross-attention in RETR, which can lead to ambiguous feature matches and degraded detection in complex indoor scenes. To address these limitations, we propose REXO (multi-view Radar object dEtection with 3D bounding boX diffusiOn), which lifts the 2D bounding box (BBox) diffusion process of DiffusionDet into the 3D radar space. REXO utilizes these noisy 3D BBoxes to guide an explicit cross-view radar feature association, enhancing the cross-view radar-conditioned denoising process. By accounting for prior knowledge that the person is in contact with the ground, REXO reduces the number of diffusion parameters by determining them from this prior. Evaluated on two open indoor radar datasets, our approach surpasses state-of-the-art methods by a margin of +4.22 AP on the HIBER dataset and +11.02 AP on the MMVR dataset.

Indoor Multi-View Radar Object Detection via 3D Bounding Box Diffusion

The multi-path commodity flow problem (MPCFP) is crucial for ensuring reliable and high-speed data transmission in communication networks. However, existing studies that employ pre-generated routing paths neglect real-time load state and the coupling among decisions, thus hindering the achievement of high-quality solutions. To overcome this, we propose Hierarchical Reinforcement Learning with Topology-Aware Exploration (HRL-TAE), which is the first fully end-to-end framework that dynamically produces high-quality solutions based on real-time network states. HRL-TAE integrates an exploration mechanism and utilizes the State Transition Guiding List (STGL) to guide state transitions, thereby transforming topology exploration into a Markov decision process. Guided by STGL, two closely coupled layers in HRL-TAE, that is, the path construct layer and the ratio allocate layer, construct multiple subpaths for each flow and allocate traffic ratios among them. Subsequently, adaptive constraint-driven masks exclude infeasible actions during decision making, thereby guaranteeing that all constraints are satisfied. We also adopt a tailored training approach to obtain accurate gradient estimates and improve training efficiency. Simulations and real-world experiments demonstrate that HRL-TAE achieves superior performance.

Downloads

Next from AAAI 2026

A Topological Rewriting of Tarski’s Mereogeometry

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

A Topological Rewriting of Tarski’s Mereogeometry

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads