Singapore

Cooperative perception is critical for autonomous driving, overcoming the inherent limitations of a single vehicle, such as occlusions and constrained fields-of-view. However, current approaches sharing dense Bird&#39;s-Eye-View (BEV) features are constrained by quadratically-scaling communication costs and the lack of flexibility and interpretability for precise alignment across asynchronous or disparate viewpoints.
While emerging sparse query-based methods offer an alternative, they often suffer from inadequate geometric representations, suboptimal fusion strategies, and training instability.
In this paper, we propose SparseCoop, a fully sparse cooperative perception framework for 3D detection and tracking that completely discards intermediate BEV representations. Our framework is built on a trio of innovations designed for robust and efficient fusion: a kinematic-grounded instance query that uses an explicit state vector with 3D geometry and velocity for precise spatio-temporal alignment; a coarse-to-fine aggregation module that effectively integrates information from both matched and unmatched instances; and a cooperative instance denoising task that provides stable, abundant supervision to accelerate and stabilize training.
Experiments on the V2X-Seq and Griffin datasets demonstrate that SparseCoop achieves new state-of-the-art performance in both 3D detection and tracking. Notably, it delivers this performance with superior computational efficiency and a highly competitive transmission cost, while showing remarkable robustness to real-world challenges like communication latency.

AAAI 2026

SparseCoop: Cooperative Perception with Kinematic-Grounded Queries

cv: motion & tracking

cv: vision for robotics & autonomous driving

cv: object detection & categorization

Cooperative perception is critical for autonomous driving, overcoming the inherent limitations of a single vehicle, such as occlusions and constrained fields-of-view. However, current approaches sharing dense Bird's-Eye-View (BEV) features are constrained by quadratically-scaling communication costs and the lack of flexibility and interpretability for precise alignment across asynchronous or disparate viewpoints.
While emerging sparse query-based methods offer an alternative, they often suffer from inadequate geometric representations, suboptimal fusion strategies, and training instability.
In this paper, we propose SparseCoop, a fully sparse cooperative perception framework for 3D detection and tracking that completely discards intermediate BEV representations. Our framework is built on a trio of innovations designed for robust and efficient fusion: a kinematic-grounded instance query that uses an explicit state vector with 3D geometry and velocity for precise spatio-temporal alignment; a coarse-to-fine aggregation module that effectively integrates information from both matched and unmatched instances; and a cooperative instance denoising task that provides stable, abundant supervision to accelerate and stabilize training.
Experiments on the V2X-Seq and Griffin datasets demonstrate that SparseCoop achieves new state-of-the-art performance in both 3D detection and tracking. Notably, it delivers this performance with superior computational efficiency and a highly competitive transmission cost, while showing remarkable robustness to real-world challenges like communication latency.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) with external knowledge retrieval, improving factual accuracy and knowledge coverage. However, existing RAG approaches face a fundamental trade-off when handling complex reasoning: while traditional iterative retrieval methods offer flexibility, their local perspective limits their ability to establish global knowledge connections. In contrast, struct-augmented RAG methods capture global relationships but incur significant construction costs. To fill in this gap, we propose MGranRAG, an innovative framework designed to integrate precise local retrieval with structured global reasoning. Our approach circumvents expensive semantic extraction by employing a lightweight contextual hierarchical graph, effectively combining the local adaptability of iterative retrieval with the global consistency of structured knowledge. The framework adopts a novel iterative optimization scheme: at the local level, the LLM identifies multi-granular contextual evidence, such as key sentences and phrases, within retrieved passages to refine retrieval. at the global level, these multi-granularity evidence nodes are then mapped and propagated within the structured hierarchical graph, enabling the diffusion of rich contextual information at different levels to introduce global semantic constraints and reorder retrieval results. This coordination between local and global iterative processes dynamically balances retrieval accuracy and contextual coherence. Experimental results on challenging multi-hop and open-domain question answering dataset show that our proposal achieves new state-of-the-art performance in both retrieval and answer accuracy.

Iterative Multi-Granular RAG with Contextual Hierarchical Graph

Spatiotemporal forecasting is a fundamental task in areas
such as traffic flow prediction, environmental sensing, and
urban planning. Recent advances have shown that decomposing
temporal signals into multiple frequencies and modeling
them jointly with spatial structures can significantly
enhance forecasting performance. However, existing multifrequency
forecasting models still face two critical limitations.
First, the importance of different temporal frequencies
evolves over time, yet most models assume fixed or static frequency
contributions. Second, spatial dependencies are inherently
frequency-sensitive. For instance, low-frequency components
often align with global spatial patterns, while highfrequency
components tend to correspond to localized interactions.
However, current approaches typically use a shared
spatial information across all frequencies, introducing spatiotemporal
inconsistency. To address these challenges, we
propose a novel Adaptive Frequency Pathways (AdaFre) for
spatiotemporal forecasting, which adaptively captures both
dynamic frequency relevance and frequency-aligned spatial
structures. AdaFre employs a multi-frequency routing mechanism
to dynamically select and aggregate the most informative
temporal frequency components, while associating each
with its corresponding spatial representation derived from
frequency-aware embeddings. Spatiotemporal backbones are
then used to model each path independently before final
aggregation. Extensive experiments on several real-world
datasets demonstrate that AdaFre significantly outperforms
state-of-the-art baselines.

Adaptive Frequency Pathways for Spatiotemporal Forecasting

Inverse Protein Folding (IPF) is a critical subtask in the field of protein design, aiming to engineer amino acid sequences capable of folding correctly into a specified three-dimensional (3D) conformation. Although substantial progress has been achieved in recent years, existing methods generally rely on either backbone coordinates or molecular surface features alone, which restricts their ability to fully capture the complex chemical and geometric constraints necessary for precise sequence prediction. To address this limitation, we present DS-ProGen, a dual-structure deep language model for functional protein design, which integrates both backbone geometry and surface-level representations. By incorporating backbone coordinates as well as surface chemical and geometric descriptors into a next-amino-acid prediction paradigm, DS-ProGen is able to generate functionally relevant and structurally stable sequences while satisfying both global and local conformational constraints. On the PRIDE dataset, DS-ProGen attains the current state-of-the-art recovery rate of 61.47\%, demonstrating the synergistic advantage of multi-modal structural encoding in protein design. Furthermore, DS-ProGen excels in predicting interactions with a variety of biological partners, including ligands, ions, and RNA, confirming its robust functional retention capabilities.

DS-ProGen: A Dual-Structure Deep Language Model for Functional Protein Design

Cognitive-functional dialogues, such as those for persuasion, consultation, and question-answering, are prevalent throughout human social interaction. The core difference between these dialogues and casual chat lies in their objective: to guide a person's cognitive and psychological state toward a predetermined one. Existing conversational technologies perform poorly in handling such dialogues. The fundamental reason is that the transformation of human cognitive psychology follows specific patterns, yet existing technologies neither account for these patterns nor possess cognitive guidance planning based on them. This deficiency makes it difficult for dialogues to achieve their intended cognitive-functional goals effectively. To address this, we propose a dynamic cognitive planning method (DyCoP). By modeling the long-term evolution of a user's cognitive psychology during the dialogue process, this method dynamically generates dialogue guidance plans that align with the principles of cognitive-psychological evolution. This allows for the generation of appropriate dialogue responses based on prior user psychology and the immediate conversational context, thereby achieving cognitive-functional goals more efficiently and accurately. Simultaneously, we constructed an evaluation framework for cognitive-functional dialogues and constructed a richly annotated emotional support conversation dataset. Comprehensive automatic and human evaluations show that our proposed DyCoP method demonstrates significant advantages over existing baseline models.

Dynamic Cognitive Planning for Cognitive-Functional Dialogue: A Case Study in Emotional Support Conversation

Temporal knowledge graph question answering (TKGQA) involves multi-hop reasoning over temporally constrained entity relationships in knowledge graph to answer a given question. 
However, at each hop, large language models (LLMs) retrieve subgraphs with numerous temporally similar and semantically complex relations, increasing the risk of suboptimal decisions and error propagation. 
To address these challenges, we propose the multi-hop reasoning enhanced (MRE) framework, which enhances both forward and backward reasoning to improve the identification of globally optimal reasoning trajectories.
Specifically, MRE begins with prompt engineering to guide LLM in generating diverse reasoning trajectories for the given question. Valid reasoning trajectories are then selected to supervise fine-tuning, serving as a cold-start strategy. Finally, we introduce tree-group relative policy optimization (T-GRPO)—a recursive, tree-structured learning-by-exploration approach. At each hop, exploration establishes strong causal dependencies on previous hop, while evaluation is informed by multi-path exploration feedback from subsequent hops. Experiments on two TKGQA benchmarks show that our MRE-based model outperforms state-of-the-art(SOTA) methods by 5.2% in accuracy on complex multi-hop queries. Further analysis highlights improved interpretability and robustness to noisy temporal annotations.

Reinforcement Learning Enhanced Muti-hop Reasoning for Temporal Knowledge Question Answering

Modern deep learning techniques focus on extracting intricate information from data to achieve accurate predictions. However, the training datasets may be crowdsourced and include sensitive information, such as personal contact details, financial data, and medical records. As a result, there is a growing emphasis on developing privacy-preserving training algorithms for neural networks that maintain good performance while preserving privacy. In this paper, we investigate the generalization and privacy performances of the differentially private gradient descent (DP-GD) algorithm, which is a private variant of the gradient descent (GD) by incorporating additional noise into the gradients during each iteration. Moreover, we identify a concrete learning task where DP-GD can achieve superior generalization performance compared to GD in training two-layer Huberized ReLU convolutional neural networks (CNNs). Specifically, we demonstrate that, under mild conditions, a small signal-to-noise ratio can result in GD producing training models with poor test accuracy, whereas DP-GD can yield training models with good test accuracy and privacy guarantees if the signal-to-noise ratio is not too small. This indicates that DP-GD has the potential to enhance model performance while ensuring privacy protection in certain learning tasks. Numerical simulations are further conducted to support our theoretical results.

Towards Understanding Generalization in DP-GD: A Case Study in Training Two-Layer CNNs

In Psychotherapy, Early Maladaptive Schemas (EMS) are entrenched negative perceptions of self or others that perpetuate mental health challenges, contribute to treatment resistance and relapse, and obstruct therapeutic progress. Addressing EMS using appropriate psychotherapeutic support (PS) strategies helps resolve core emotional deficits, mitigate resistance, and improve client engagement. Moreover, adapting polite and empathetic communication based on clients’ emotional states fosters trust, emotional safety, and a conducive therapeutic environment, which is critical for addressing EMS and achieving positive outcomes. Motivated by these insights, we introduce MATE - a novel EMS-guided polite and empAthetic dialogue sysTem for psychothErapeutic support. MATE integrates a Large Language Model (LLM) with a Mixture of Experts-based Reinforcement Learning (MoE-RL) approach to overcome the limitations of traditional RL methods, such as large action spaces and generic responses. The LLM captures diverse semantic patterns from dialogue context. MoE-RL leverages dedicated psychotherapeutic, politeness, and empathy experts, along with a new reward function, comprising PS, politeness, empathy, contextual consistency, and diversity rewards to guide policy learning for effective response generation. Evaluations on the HOPE and PSYCON datasets demonstrate MATE’s efficacy in generating polite and empathetic psychotherapeutic
responses based on clients’ EMS and emotional cues while ensuring contextual consistency and diversity.

Facilitating Early Maladaptive Schema–Guided Polite and Empathetic Psychotherapeutic Support: An LLM-Driven MoE-RL-Based Dialogue System

Zero-shot singing voice conversion (SVC) transforms a source singer's timbre to an unseen target speaker's voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources. To overcome these limitations, we propose HQ-SVC, an efficient framework for high-quality zero-shot SVC. HQ-SVC first extracts jointly content and speaker features using a decoupled codec. It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information typically lost in separate modeling approaches, and progressively refines outputs via differentiable signal processing and diffusion techniques. Evaluations confirm HQ-SVC significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency. Beyond voice conversion, HQ-SVC achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.

HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

Bootstrapping large language models (LLMs) through preference-based policy optimization offers a promising direction for aligning model behavior with human preferences without relying on extensive manual annotations. In this work, we propose a novel online preference-based policy optimization (PbPO) framework that formulates the learning process as a min-max game between the main policy and a reward model (RM). The RM is constrained within a confidence set derived from preference data to ensure reliable exploitation. Our iterative online algorithm actively collects preference data through guided exploration of the evolving policy, enabling continual self-improvement of both the policy and the RM. We provide theoretical guarantees for our method, establishing a high-probability regret bound of order $\widetilde{\mathcal{O}}(d\sqrt{T})$, demonstrating its effectiveness in bootstrapping LLMs. Extensive experiments on five benchmarks show that our approach consistently outperforms existing state-of-the-art preference optimization techniques.

Bootstrapping LLMs via Preference-Based Policy Optimization

Biological sequences, including RNAs and proteins, share similarities with natural languages, enabling the application of advanced language models to various biological tasks. However, due to its flexibility and lack of experimental data, RNA is a particularly challenging biological ``language'' compared to other biological sequences like proteins. RNA multiple sequence alignments (MSAs), which align evolutionarily related RNA sequences, can greatly enhance RNA biology modeling, as evidenced by their significant roles in structure prediction and function annotation. This raises the question of whether RNA MSAs can also benefit RNA design, which remains unexplored. This paper introduces RMSAGen, a model comprising RMSA-Encoder and RMSA-Decoder, that leverages MSAs to design functional RNA sequences. RMSA-Encoder effectively extracts MSA features, enhancing performance in functional prediction and solvent accessibility prediction tasks and supporting RMSA-Decoder in accurate RNA generation. RMSAGen can design RNA-binding protein sequences that effectively bind to target proteins, and the design performance improves with an increasing number of sequences. In addition, the ribozymes designed with structural features by RMSAGen show superior computational metrics and exhibit biological activity during gel electrophoresis. These results highlight effectiveness of RMSAGen, establishing it as a powerful tool and a new direction for RNA design.

Downloads

Next from AAAI 2026

Iterative Multi-Granular RAG with Contextual Hierarchical Graph

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Iterative Multi-Granular RAG with Contextual Hierarchical Graph

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads