Singapore

Cross-tokenizer knowledge distillation, where the teacher and student employ different tokenizers, is becoming increasingly prevalent, yet it poses underexplored challenges: existing methods fail to capture the rich knowledge encoded in teacher logits, as evidenced by the neglect of semantic information, inaccurate and biased logit alignment, and discarding distributional structure—ultimately leading to unfavorable distillation. To address these issues, we propose SeDi, a semantics and distribution-aware knowledge transfer framework tailored for cross-tokenizer distillation. To preserve factual knowledge, SeDi employs bipartite graph-based alignment at the tokenization level and a sliding window re-encoding strategy at the vocabulary level, enabling unbiased transfer of the teacher’s next-token predictions into the student’s vocabulary space. To further retain distributional information, we align the student’s entropy with that of the teacher by incorporating the student’s own logits during training, which helps to mitigate the exposure bias problem. Experiments on ten datasets across three task domains and five different teacher-student model pairs with varying vocabulary sizes demonstrate that SeDi delivers substantial improvements, with gains of up to 19.8%.

AAAI 2026

Bridging the Tokenizer Gap: Semantics and Distribution-aware Knowledge Transfer for Unbiased Cross-Tokenizer Distillation

cross tokenizer distillation

language model

knowledge distillation

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Multi-view Learning (MVL) is a pivotal paradigm widely adopted across various fields. Despite recent advances, significant challenges remain. Existing methods primarily focus on enhancing the performance of fused multi-view representation, often neglecting the issue of Representation Degradation (RD) arising from discrepancies in the intrinsic quality of different views. To address the limitations, we propose a novel Granular-ball Fuzzy Split and Attention Fusion (GFSAF) learning, which leverages the nature of granular-ball to extract mutual and complementary representation separately. Meanwhile, the proposed method introduces an attention variant for fused representations to mitigate the RD problem. Specifically, GFSAF mainly consists of two training stages: Split-Extract Stage (SES) and Views-Fusion Stage (VFS). First, in SES, we design a novel Granular-ball Fuzzy Contrastive Learning (GFCL) to extract mutual representation, and introduce Noise Stripping Loss (L_NS) to reduce the influence of noise for complementary representation. Next, a novel multi-head Cross Views Attention (CVA) is proposed to employ attention mechanism from multi-view perspectives for comprehensive fused representations. Experimental results on eight databases demonstrate that GFSAF achieves superior performance compared to several state-of-the-art methods. Code is available at the supplementary materials.

Views Attention Fusion of Granular-ball Fuzzy Representations Split for Improved Multi-view Clustering

Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS), an emerging task that adapts Open-Vocabulary Segmentation (OVS) to the remote sensing (RS) domain, remains underexplored due to the absence of a unified evaluation benchmark and the domain gap between natural and RS images.
To bridge these gaps, we first establish a standardized OVRSIS benchmark (**OVRSISBench**) based on widely-used RS segmentation datasets, enabling consistent evaluation across methods. Using this benchmark, we comprehensively evaluate several representative OVS/OVRSIS models and reveal their limitations when directly applied to remote sensing scenarios.
Building on these insights, we propose **RSKT-Seg**, a novel open-vocabulary segmentation framework tailored for remote sensing. RSKT-Seg integrates three key components: (1) a Multi-Directional Cost Map Aggregation (RS-CMA) module that captures rotation-invariant visual cues by computing vision-language cosine similarities across multiple directions; (2) an Efficient Cost Map Fusion (RS-Fusion) transformer, which jointly models spatial and semantic dependencies with a lightweight dimensionality reduction strategy; and (3) a Remote Sensing Knowledge Transfer (RS-Transfer) module that injects pre-trained knowledge and facilitates domain adaptation via enhanced upsampling.
Extensive experiments on the benchmark show that RSKT-Seg consistently outperforms strong OVS baselines by +3.8 mIoU and +5.9 mACC, while achieving 2× faster inference through efficient aggregation. Our anonymous code is in the appendix.

Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing

Diffusion models show promise for 3D molecular generation, but face a fundamental trade-off between sampling efficiency and conformational accuracy. While flow-based models are fast, they often produce geometrically inaccurate structures, as they have difficulty capturing the multimodal distributions of molecular conformations. In contrast, denoising diffusion models are more accurate but suffer from slow sampling, a limitation attributed to sub-optimal integration between diffusion dynamics and SE(3)‑equivariant architectures. 
To address this, we propose **VEDA**, a unified SE(3)-equivariant framework that combines variance-exploding diffusion with annealing to efficiently generate conformationally accurate 3D molecular structures. Specifically, our key technical contributions include: (1) a VE schedule that enables noise injection functionally analogous to simulated annealing, improving 3D accuracy and reducing relaxation energy; (2) a novel preconditioning scheme that reconciles the coordinate-predicting nature of SE(3)-equivariant networks with a residual-based diffusion objective, and (3) a new arcsin-based scheduler that concentrates sampling in critical intervals of the logarithmic signal-to-noise ratio. 
On the QM9 and GEOM-DRUGS datasets, VEDA matches the sampling efficiency of flow-based models, achieving state-of-the-art valency stability and validity with only 100 sampling steps. More importantly, VEDA's generated structures are remarkably stable, as measured by their relaxation energy ($\Delta E_{{relax}}$) during GFN2-xTB optimization. The median energy change is only 1.72kcal/mol, significantly lower than the 32.3kcal/mol from its architectural baseline, SemlaFlow. Our framework demonstrates that principled integration of VE diffusion with SE(3)-equivariant architectures can achieve both high chemical accuracy and computational efficiency.

VEDA: Generation of 3D Molecules via Variance-Exploding Diffusion with Annealing

Realistic background traffic is critical to the simulation platforms for autonomous driving (AD) testing. Given most of vehicles in reality are driven by human beings, introducing human driving (HD) vehicles to the background traffic is necessary and able to discover more problems of tested AD vehicle in the simulation stage. However, existing methods rely on ad-hoc rules or data-driven trainings to mimic partial human driver behaviors, which is not comprehensive and lack of transparency. In this work, we design a smart human driving vehicle simulator HDSim which is empowered by cognitively inspired modeling and AI models. HDSim enables diverse, realistic and scalable HD traffic simulation on AD testing platforms like Carla in a non-intrusive measure. There are two novel components in HDSim. First, we introduce a formalized model to guide the generation of diverse human driving styles by using different combinations of latent cognitive factors in a hierarchy. Second, we design a Perception-Mediated Behavior Influence (PMBI) mechanism to uses LLM-assisted perceptual transformations to indirectly fuse driving actions with driving styles. Experiments show that HDSim traffic can help simulation platforms like Carla to reveal 68% more failures of tested AD vehicles, and the explainability of reported accidents is also improved.

Diverse Human Driving Vehicle Simulation in Background Traffic for Autonomous Driving Tests

Accurate point tracking in surgical environments remains challenging due to complex visual conditions, including smoke occlusion, specular reflections, and tissue deformation. While existing surgical tracking datasets provide coordinate information, they lack the semantic context necessary to understand tracking failure mechanisms. We introduce VL-SurgPT, the first large-scale multimodal dataset that bridges visual tracking with textual descriptions of point status in surgical scenes. The dataset comprises 908 in vivo video clips, including 754 for tissue tracking (17,171 annotated points across five challenging scenarios) and 154 for instrument tracking (covering seven instrument types with detailed keypoint annotations). We establish comprehensive benchmarks using eight state-of-the-art tracking methods and propose TG-SurgPT, a text-guided tracking approach that leverages semantic descriptions to improve robustness in visually challenging conditions. Experimental results demonstrate that incorporating point status information significantly improves tracking accuracy and reliability, particularly in adverse visual scenarios where conventional vision-only methods struggle. By bridging visual and linguistic modalities, VL-SurgPT enables the development of context-aware tracking systems crucial for advancing computer-assisted surgery applications that can maintain performance even under challenging intraoperative conditions.

Bridging Vision and Language for Robust Context-Aware Surgical Point Tracking: The VL-SurgPT Dataset and Benchmark

Multi-view video reconstruction plays a vital role in computer vision, enabling applications in film production, virtual reality, and motion analysis. While recent advances such as 3D Gaussian Splatting have demonstrated impressive capabilities in dynamic scene reconstruction, they typically rely on the assumption that input video streams are temporally synchronized. However, in real-world scenarios, this assumption often fails due to factors like camera trigger delays, frame rate discrepancies, or independent recording setups—leading to temporal misalignment across views and reduced reconstruction quality. To address this challenge, a novel temporal alignment strategy is proposed for high-quality 4DGS reconstruction from unsynchronized multi-view videos. Our method features a coarse-to-fine alignment module that estimates and compensates for each camera's time shift. The method first determines a coarse, frame-level offset and then refines it to achieve sub-frame accuracy. This strategy can be integrated as a plug-and-play module into existing 4DGS frameworks, enhancing their robustness when handling asynchronous data. Experiments show that this approach effectively processes temporally misaligned videos and significantly enhances baseline methods.

Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos

Achieving appropriate human reliance on Artificial Intelligence (AI) systems remains a central challenge in Human-Computer Interaction. Confidence scores—indicators of an AI system’s certainty in its recommendations—have been proposed as a means to help users calibrate their trust and reliance on AI Decision Support Systems (DSS). However, limited research has explored how well-calibrated versus miscalibrated confidence scores affect human decision-making.
We report a study examining the effects of confidence calibration on user reliance, decision accuracy, and perceived utility of an AI DSS. In a within-subjects experiment involving 184 participants solving logic puzzles, we found that well-calibrated confidence scores significantly improved decision accuracy (+20\%, 95\% CI: [0.18, 0.23]), whereas miscalibrated scores yielded minimal accuracy gains (+2\%, 95\% CI: [-0.01, 0.04]) and increased vulnerability to automation bias and conservatism bias. Participants were more likely to accept AI recommendations when high confidence was expressed, even when those recommendations were incorrect, resulting in errors. Conversely, miscalibrated and low-confidence recommendations increased conservatism bias, leading users to reject even accurate AI suggestions.
Perceived utility of the AI system was higher when confidence levels were high (p < 0.001) and when confidence was well-calibrated (p = 0.002). These findings underscore the importance of designing AI systems with properly calibrated confidence cues to improve human-AI collaboration and mitigate reliance-related biases.

Too Sure for Our Own Good: A User Study on AI Confidence and Human Reliance

Linear Temporal Logic on Finite Traces (LTLf) is a popular logic to express declarative specifications in Artificial Intelligence (AI). 
The recent call for explainable AI tools has made relevant the problem of computing efficiently minimal unsatisfiable cores (MUCs) and minimal correction sets (MCSs) of LTLf formulas. Recent work has focused on extraction of MUCs on formulas in conjunctive form.
In this paper, we present a method that operates on arbitrary formulas and computes a more refined notion of MUCs, as introduced by Schuppan, along with the corresponding notion of MCSs. Experiments show that our system, based on Answer Set Programming, outperforms available tools.

Computing Syntax Tree-based Minimal Unsatisfiable Cores of LTLf Formulas

Large language models (LLMs) have shown promise in medical question answering, yet they often overlook the domain-specific expertise that professionals depend on—such as the clinical subject areas (e.g., trauma, airway) and the certification level (e.g., EMT, Paramedic). Existing approaches typically apply general-purpose prompting or retrieval strategies without leveraging this structured context, limiting performance in high-stakes settings. We address this gap with EMSQA, an 24.3K-question multiple-choice dataset spanning 10 clinical subject areas and 4 certification levels, accompanied by curated, subject area-aligned knowledge bases (40K documents and 2M tokens). Building on EMSQA, we introduce (i) Expert-CoT, a prompting strategy that conditions chain-of-thought (CoT) reasoning on specific clinical subject area and certification level, and (ii) ExpertRAG, a retrieval-augmented generation pipeline that grounds responses with subject area-aligned documents and real-world patient data. Experiments on 4 LLMs show that CoT-Expert improves up to 2.05\% over vanilla CoT prompting. Additionally, combining CoT-Expert with ExpertRAG yields up to a 4.67\% accuracy gain over standard RAG baselines. Notably, the 32B expertise-augmented LLMs pass all the computer-adaptive EMS certification simulation exams.

Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering

Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences.
Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokenization module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks. Symmetrically employing a Latent Decoder and a Local Decoder, MergeDNA learns with two pre-training tasks: Merged Token Reconstruction simultaneously trains the dynamic tokenization module and adaptively filters important tokens, while Adaptive Masked Token Modeling learns to predict these filtered tokens to capture informative contents. Extensive experiments show that MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation, outperforming typical tokenization methods and large-scale DNA foundation models.

Downloads

Next from AAAI 2026

Views Attention Fusion of Granular-ball Fuzzy Representations Split for Improved Multi-view Clustering

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES