United States

Although multiview fusion has demonstrated potential in LiDAR segmentation, its dependence on computationally intensive point-based interactions, arising from the lack of fixed correspondences between views such as range view and Bird&#39;s-Eye View (BEV), hinders its practical deployment. This paper challenges the prevailing notion that multiview fusion is essential for achieving high performance. We demonstrate that significant gains can be realized by directly fusing Polar and Cartesian partitioning strategies within the BEV space. Our proposed BEV-only segmentation model leverages the inherent fixed grid correspondences between these partitioning schemes, enabling a fusion process that is orders of magnitude faster (160x speedup) than conventional point-based methods. Furthermore, our approach facilitates dense feature fusion, preserving richer contextual information compared to sparse point-based alternatives. To enhance scene understanding without compromising efficiency, we introduce a hybrid Transformer-CNN architecture. Extensive evaluation on the SemanticKITTI and nuScenes datasets provides compelling evidence that our method outperforms previous multiview fusion approaches in terms of both performance and inference speed, highlighting the potential of BEV-based fusion for LiDAR segmentation. The code will be made publicly available to foster further research and development in this domain.

AAAI 2025

PC-BEV: An Efficient Polar-Cartesian BEV Fusion Framework for LiDAR Semantic Segmentation

vision for robotics

autonomous driving

Although multiview fusion has demonstrated potential in LiDAR segmentation, its dependence on computationally intensive point-based interactions, arising from the lack of fixed correspondences between views such as range view and Bird's-Eye View (BEV), hinders its practical deployment. This paper challenges the prevailing notion that multiview fusion is essential for achieving high performance. We demonstrate that significant gains can be realized by directly fusing Polar and Cartesian partitioning strategies within the BEV space. Our proposed BEV-only segmentation model leverages the inherent fixed grid correspondences between these partitioning schemes, enabling a fusion process that is orders of magnitude faster (160x speedup) than conventional point-based methods. Furthermore, our approach facilitates dense feature fusion, preserving richer contextual information compared to sparse point-based alternatives. To enhance scene understanding without compromising efficiency, we introduce a hybrid Transformer-CNN architecture. Extensive evaluation on the SemanticKITTI and nuScenes datasets provides compelling evidence that our method outperforms previous multiview fusion approaches in terms of both performance and inference speed, highlighting the potential of BEV-based fusion for LiDAR segmentation. The code will be made publicly available to foster further research and development in this domain.

technical paper

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Learning a perception and reasoning module for robotic assistants to plan steps to perform complex tasks based on natural language instructions often requires large free-form language annotations, especially for short high-level instructions. To reduce the cost of annotation, large language models (LLMs) are used as a planner with few data. However, when elaborating the steps, even the state-of-the-art planner that uses LLMs mostly relies on linguistic common sense, often neglecting the status of the environment at command reception, resulting in inappropriate plans. To generate plans grounded in the environment, we propose FLARE (Few-shot Language with environmental Adaptive Replanning Embodied agent), which improves task planning using both language command and environmental perception. As language instructions often contain ambiguities or incorrect expressions, we additionally propose to correct the mistakes using visual cues from the agent. The proposed scheme allows us to use a few language pairs thanks to the visual cues and significantly outperforms state-of-the-art approaches (by twice the success rate in unseen environments of the ALFRED benchmark: $16.42$\% $\to$ $40.88$\%).

Multi-Modal Grounded Planning and Efficient Replanning for Learning Embodied Agents with a Few Examples

We study the problem of modeling a non-linear dynamical system when given a time series by deriving equations directly from the data. Despite the fact that time series data are given as input, models for dynamics and estimation algorithms that incorporate long-term temporal dependencies are largely absent from existing studies. In this paper, we introduce a latent state to allow time-dependent modeling and formulate this problem as a dynamics estimation problem in latent states. We face multiple technical challenges, including (1) modeling latent non-linear dynamics and (2) solving circular dependencies caused by the presence of latent states. To tackle these challenging problems, we propose a new method, Latent Non-Linear equation modeling (LaNoLem), that can model a latent non-linear dynamical system and a novel alternating minimization algorithm for effectively estimating latent states and model parameters. In addition, we introduce criteria to control model complexity without human intervention. Compared with the state-of-the-art model, LaNoLem achieves competitive performance for estimating dynamics while outperforming other methods in prediction.

Modeling Latent Non-Linear Dynamical System over Time Series

Visual Language Models such as CLIP excel in image recognition due to extensive image-text pre-training. However, applying the CLIP inference in zero-shot classification, particularly for medical image diagnosis, faces challenges due to: 1) {the inadequacy of representing image classes solely with single category names}; 2) {the modal gap between the visual and text spaces generated by CLIP encoders}. Despite attempts to enrich disease descriptions with large language models, the lack of class-specific knowledge often leads to poor performance. In addition, empirical evidence suggests that existing proxy learning methods for zero-shot image classification on natural image datasets exhibit instability when applied to medical datasets. 
To tackle these challenges, we introduce the Knowledge Proxy Learning (KPL) to mine knowledge from CLIP. 
KPL is designed to leverage CLIP's multimodal understandings for medical image classification through {Text Proxy Optimization} and {Multimodal Proxy Learning}.
Specifically, KPL retrieves image-relevant knowledge descriptions from the constructed knowledge-enhanced base to enrich semantic text proxies.
It then harnesses input images and these descriptions, encoded via CLIP, to stably generate multimodal proxies that boost the zero-shot classification performance.
Extensive experiments conducted on both medical and natural image datasets demonstrate that KPL enables effective zero-shot image classification, outperforming all baselines. These findings highlight the great potential in this paradigm of mining knowledge from CLIP for medical image classification and broader areas. Code will be publicly available.

KPL: Training-Free Medical Knowledge Mining of Vision-Language Models

We investigate the problem of checking the consistency of qualitative preferences expressed in CP-theory based languages. The problem is
PSPACE-Complete even when the preferences are locally consistent or the preference variables have binary domain. We present a new sufficient condition for consistency of preferences and show that the condition can be verified in polynomial time in settings of practical relevance (locally consistent or binary domain preference variables). We show how the resulting sufficient condition can be used to efficiently identify of a subset of outcomes that are non-dominated with respect to a set of qualitative preferences.

Checking Consistency of CP-Theory Preferences in Polynomial Time

Functional decomposition is the process of breaking down a function $f$ into a composition $f=g(f_1,\dots,f_k)$ of simpler functions $f_1,\dots,f_k$ belonging to some class $\mathcal{F}$. This fundamental notion can be used to model applications arising in a wide variety of contexts, ranging from machine learning to formal language theory. 
In this work, we study functional decomposition by leveraging on the notion of functional reconfiguration. In this setting, constraints are imposed not only on the factor functions $f_1,\dots,f_k$ but also on the intermediate functions arising during the composition process. 

We introduce a symbolic framework to address functional reconfiguration and decomposition problems. In our framework, functions arising during the reconfiguration process are represented symbolically, using ordered binary decision diagrams (OBDDs). The function $g$  used to specify the reconfiguration process is represented by a Boolean circuit $C$. Finally, the function class $\mathcal{F}$ is represented by a second-order finite automaton $\mathcal{A}$. Our main result states that functional reconfiguration, and hence functional decomposition, can be solved in fixed-parameter linear time when parameterized by the width of the input OBDD, by structural parameters associated with the reconfiguration circuit $C$, and by the size of the second-order finite automaton $\mathcal{A}$.

Symbolic Functional Decomposition: A Reconfiguration Approach

Today, deep neural networks are widely used since they can handle a variety of complex tasks. Their generality makes them very powerful tools in modern technology. However, deep neural networks are often overparameterized. The usage of these large models consumes a lot of computation resources. 

In this paper, we introduce a method called **T**ill the **L**ayers **C**ollapse (TLC), which compresses deep neural networks through the lenses of batch normalization layers. By reducing the depth of these networks, our method decreases deep neural networks' computational requirements and overall latency. We validate our method on popular models such as Swin-T, MobileNet-V2, and RoBERTa, across both image classification and natural language processing (NLP) tasks. The source code will be made publicly available upon the acceptance of this paper.

Till the Layers Collapse: Compressing a Deep Neural Network Through the Lenses of Batch Normalization Layers.

We study Consistent Query Answering (CQA) over knowledge bases with existential rules. Specifically, we propose a novel framework for CQA that combines previous approaches, allowing for the simultaneous presence of both open and closed predicates, i.e. predicates interpreted under open- and closed-world assumption, respectively. We establish the data complexity of answering unions of conjunctive queries in such a new framework under the so-called AR semantics and for different classes of existential rules. We also provide new complexity results for the standard (i.e. non-inconsistency tolerant) query answering in the presence of both open and closed predicates. Our results show that, for certain classes of rules, the complexity of CQA matches that of non-inconsistency-tolerant query answering.

Consistent Query Answering over Existential Rules with Open and Closed Predicates

Language-conditioned robotic manipulation in unstructured environments presents significant challenges for intelligent robotic systems. However, due to partial observation or imprecise action prediction, failure may be unavoidable for learned policies. Moreover, operational failures can lead to the robotic arm entering an untrained state, potentially causing destructive results. Consequently, the ability to detect and self-correct failures is crucial for the development of practical robotic systems. To address this challenge, we propose a foresight-driven failure detection and self-correction module for robot manipulation. By leveraging 3D Gaussian Splatting, we represent the current scene with multiple Gaussians. Subsequently, we train a prediction network to forecast the Gaussian representation of future scenes conditioned on planned actions. Failure is detected when the predicted future significantly deviates from the real observation after action execution. In such cases, the end-effector rolls back to the previous action to avoid an untrained state.  Integrating this approach with the PerACT framework, we develop a self-correcting robot manipulation policy. Evaluations on ten RLBench tasks with 166 variations demonstrate the superior performance of the proposed method, which outperforms state-of-the-art methods by 12.0%  success rate on average.

Self-Correcting Robot Manipulation via Gaussian-Splatted Foresight

Recent advancements in question generation (QG) have been significantly propelled by reinforcement learning (RL). Although extensive reward functions have been designed to capture the attributes of ideal questions, their associated learning challenges, particularly in sample efficiency and diversity, remain underexplored. This paper introduces a bilevel policy decomposition (BPD) framework and a diversity-seeking RL (DSRL) objective to address these issues. The BPD framework employs two cascading policies that divide QG into more manageable sub-tasks: answer-centric summary generation and summary-augmented QG, facilitating exploration and accelerating policy learning. Concurrently, the DSRL objective preserves the inherent diversity property of QG by ensuring the bilevel policies align probabilistically with their reward functions, rather than merely maximizing returns. Our integrated approach, termed BPD-DSRL, demonstrates superior performance over existing baselines on multiple question quality and diversity metrics across various QG benchmarks.

Enhancing Question Generation through Diversity-Seeking Reinforcement Learning with Bilevel Policy Decomposition

Dynamic interacting system modeling is important for understanding and simulating real world systems, e.g., meteorology and the spread of COVID. The system is typically described as a graph, where multiple objects dynamically interact with each other and evolve over time. In recent years, graph Ordinary Differential Equations (ODE) receive increasing research attentions. While achieving encouraging results, existing solutions prioritize the traditional Euclidean space, and neglect the intrinsic geometry of the system and physics laws, e.g., the principle of entropy increasing. The aforementioned limitations motivate us to rethink the system dynamics from a fresh perspective of Riemannian geometry, and pose a more realistic problem of physics-informed dynamic system modeling, considering the underlying geometry and physics law for the first time. In this paper, we present a novel physics-informed Riemannian graph ODE for a wide range of entropy-increasing dynamic systems (termed as Pioneer). In particular, we formulate a differential system on the Riemannian manifold, where a manifold-valued graph ODE is governed by the proposed constrained Ricci flow, and a manifold preserving Gyro-transform aware of system geometry. Theoretically, we report the provable entropy non-decreasing of our formulation, obeying the physics laws. Empirical results show the superiority of Pioneer on real datasets.

Premium content

Downloads

Next from AAAI 2025

Multi-Modal Grounded Planning and Efficient Replanning for Learning Embodied Agents with a Few Examples

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES