United States

Compared with previous 3D reconstruction methods like Nerf, recent Generalizable 3D Gaussian Splatting (G-3DGS) methods demonstrate impressive efficiency even in the sparse-view setting. However, the promising reconstruction performance of existing G-3DGS methods relies heavily on accurate multi-view feature matching, which is quite challenging. Especially for the scenes that have many non-overlapping areas between various views and contain numerous similar regions, the matching performance of existing methods is poor and the reconstruction precision is limited. To address this problem, we develop a strategy that utilizes a predicted depth confidence map to guide accurate local feature matching. In addition, we propose to utilize the knowledge of existing monocular depth estimation models as prior to boost the depth estimation precision in non-overlapping areas between views. Combining the proposed strategies, we present a novel G-3DGS method named TranSplat, which obtains the best performance on both the RealEstate10K and ACID benchmarks while maintaining competitive speed and presenting strong cross-dataset generalization ability.

AAAI 2025

TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers

3d computer vision

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Accurate prediction of 3D semantic occupancy from 2D visual images is vital in enabling autonomous agents to comprehend their surroundings for planning and navigation. State-of-the-art methods typically employ fully supervised approaches, necessitating a huge labeled dataset acquired through expensive LiDAR sensors and meticulous voxel-wise labeling by human annotators. The resource-intensive nature of this annotating process significantly hampers the application and scalability of these methods. We introduce a novel semi-supervised framework to alleviate the dependency on densely annotated data. Our approach leverages 2D foundation models to generate essential 3D scene geometric and semantic cues, facilitating a more efficient training process. Our framework exhibits notable properties: (1) Generalizability, applicable to various 3D semantic scene completion approaches, including 2D-3D lifting and 3D-2D transformer methods. (2) Effectiveness, as demonstrated through experiments on SemanticKITTI and NYUv2, wherein our method achieves up to 85\% of the fully-supervised performance using only 10\% labeled data. This approach not only reduces the cost and labor associated with data annotation but also demonstrates the potential for broader adoption in camera-based systems for 3D semantic occupancy prediction.

Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance

Large language models (LLMs) have triggered a new stream of research focusing on compressing the context length to reduce the computational cost while ensuring the retention of helpful information for LLMs to answer the given question. Token-based removal methods are one of the most prominent approaches in this direction, but risk losing the semantics of the context caused by intermediate token removal, especially under high compression ratios, while also facing challenges in computational efficiency. In this work, we propose context-aware prompt compression (CPC), a sentence-level prompt compression technique where its key innovation is a novel context-aware sentence encoder that provides a relevance score for each sentence for a given question. To train this encoder, we generate a new dataset consisting of questions, positives, and negative pairs where positives are sentences relevant to the question, while negatives are irrelevant context sentences. We train the encoder in a contrastive setup to learn context-aware sentence representations. Our method considerably outperforms prior works on prompt compression on benchmark datasets and is up to 10.93$\times$ faster at inference compared to the best token-level compression method. We also find better improvement for shorter length constraints in most benchmarks, showing the effectiveness of our proposed solution in the compression of relevant information in a shorter context. Finally, we release the code and the dataset for quick reproducibility and further development: https://anonymous.4open.science/r/CPC-Compression.

Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference

Occupancy prediction plays a pivotal role in autonomous driving (AD) due to its capabilities of fine-grained 3D perception and general object recognition. However, existing methods often incur high computational costs, which conflict with AD's real-time demand. To this end, we redirect the focus from accuracy only to both accuracy and efficiency. By conducting a head-to-head comparison of existing methods, we find it challenging to balance accuracy and efficiency. We identify a core issue for this challenge: the strong coupling between geometry and semantics. Specifically, the predicted geometric structure (e.g., depth) guides the projection of 2D image features into 3D voxel space, which significantly affects feature discriminability and subsequent semantic learning. To address this issue, we focus on two key aspects: model design and learning strategies. 1) For model design, we propose a dual-branch network that decouples the representation of geometry and semantics. The voxel branch utilizes a novel re-parameterized large-kernel 3D convolution to refine geometric structure efficiently, while the BEV branch employs temporal fusion and BEV encoding for efficient semantic learning. 2) For learning strategies, we propose to separate geometric learning from semantic learning by the mixup of ground-truth and predicted depths.
Our method achieves 39.4\% mIoU at 20 FPS on Occ3D-nuScenes benchmark, showcasing a state-of-the-art balance between accuracy and efficiency. Codes are available in supplementary materials.

Achieving Speed-Accuracy Balance in Vision-based 3D Occupancy Prediction via Geometric-Semantic Disentanglement

Sequential Recommenders (SRs) are trained to predict the next item as the target given its preceding items as the input, assuming that every input-target pair is matched and is reliable for training. However, users can be induced by external distractions (e.g. friends’ recommendations) to click on items inconsistent with their true preferences, resulting in unreliable training instances with mismatched input-target pairs. To resist unreliable data, researchers attempt to develop Robust SRs (RSRs). However, our data analysis unveils that existing RSRs are data-driven, which tend to classify instances involving frequently co-occurred items as reliable. Yet, for most instances formed by infrequently co-occurred items, existing RSRs are uncertain about their reliability. To fill this gap, we propose a generic framework – LLM4RSR (Large Language Models for Robust Sequential Recommendation) to semantically complement data-driven RSRs by correcting uncertain instances into reliable ones based on LLMs’ semantic comprehension of items beyond co-occurrence. In this way, RSRs can be re-trained with the corrected data for better accuracy. This is a selective knowledge distillation procedure, where the LLM acts as a teacher guiding student RSRs via uncertain instances. To align LLMs with the data correction task and mitigate inherent hallucinations, we equip the LLM with profile, plan, and memory modules, which are automatically optimized via textual gradient descent, eliminating the need for human effort and expertise. Experiments on four real-world datasets spanning eight backbones verify the generality, effectiveness, and efficiency of LLM4RSR.

LLM4RSR: Large Language Models as Data Correctors for Robust Sequential Recommendation

Data-free quantization (DFQ) is a technique that creates a lightweight network from its full-precision counterpart without the original training data, often through a synthetic dataset. Although several DFQ methods have been proposed for vision transformer (ViT) architectures, they fail to achieve efficacy in low-bit settings. Examining the existing methods, we observe that their synthetic data produce misaligned attention maps, while those of the real samples are highly aligned. From this observation, we find that aligning attention maps of synthetic data helps improve the overall performance of quantized ViTs. Motivated by this finding, we devise MimiQ, a novel DFQ method designed for ViTs that enhances inter-head attention similarity. First, we generate synthetic data by aligning head-wise attention outputs from each spatial query patch. Then, we align the attention maps of the quantized network to those of the full-precision teacher by applying head-wise structural attention distillation. The experimental results show that the proposed method significantly outperforms baselines, setting a new state-of-the-art performance for data-free ViT quantization.

MimiQ: Low-Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity

Significant advancements have been achieved in the realm of understanding poses and interactions of two hands manipulating an object. The emergence of augmented reality (AR) and virtual reality (VR) technologies has heightened the demand for real-time performance in these applications. However, current state-of-the-art models often exhibit promising results at the expense of substantial computational overhead. In this paper, we present a query-optimized real-time Transformer (QORT-Former), the first Transformer-based real-time framework for 3D pose estimation of two hands and an object. We first limit the number of queries and decoders to meet the efficiency requirement. Given limited number of queries and decoders, we propose to optimize queries which are taken as input to the Transformer decoder, to secure the good accuracy: (1) we propose to divide queries into three types (a left hand query, a right hand query and an object query) and enhance query features (2) by using the contact information between hands and an object and (3) by using two stage update of enhanced image and query features with respect to one another. With proposed methods, we achieved real-time pose estimation performance using just 108 queries and 1 decoder (53.5 FPS on an RTX 3090TI GPU). Surpassing state-of-the-art results on the H2O dataset by 17.6% (left hand), 22.8% (right hand), and 27.2% (object), as well as on the FPHA dataset by 5.3% (right hand) and 10.4% (object), our method excels in accuracy. Additionally, it sets the state-of-the-art in interaction recognition, maintaining real-time efficiency with an off-the-shelf action recognition module.

QORT-Former: Query-optimized Real-time Transformer for Understanding Two Hands Manipulating Objects

3D Gaussian Splatting (3DGS) has shown promising performance in novel view synthesis. Previous methods adapt it to obtaining surfaces of either individual 3D objects or within limited scenes. In this paper, we make the first attempt to tackle the challenging task of large-scale scene surface reconstruction. This task is particularly difficult due to the high GPU memory consumption, different levels of details for geometric representation, and noticeable inconsistencies in appearance. To this end, we propose GigaGS, the first work for high-quality surface reconstruction for large-scale scenes using 3DGS. GigaGS first applies a partitioning strategy based on the mutual visibility of spatial regions, which effectively grouping cameras for parallel processing. To enhance the quality of the surface, we also propose novel multi-view photometric and geometric consistency constraints based on Level-of-Detail representation. In doing so, our method can reconstruct detailed surface structures. Comprehensive experiments are conducted on various datasets. The consistent improvement demonstrates the superiority of GigaGS. All code and data will be made public upon acceptance.

GigaGS: 3D Gaussian Based Planar Representation for Large-Scene Surface Reconstruction

Editing videos with textual guidance has garnered popularity due to its streamlined process which mandates users to solely edit the text prompt corresponding to the source video. Recent studies have explored and exploited large-scale text-to-image diffusion models for text-guided video editing, resulting in remarkable video editing capabilities. However, they may still suffer from some limitations such as mislocated objects, incorrect number of objects. Therefore, the controllability of video editing remains a formidable challenge. In this paper, we aim to challenge the above limitations by proposing a Re-Attentional Controllable Video Diffusion Editing (ReAtCo) method. Specially, to align the spatial placement of the target objects with the edited text prompt in a training-free manner, we propose a Re-Attentional Diffusion (RAD) to refocus the cross-attention activation responses between the edited text prompt and the target video during the denoising stage, resulting in a spatially location-aligned and semantically high-fidelity manipulated video. In particular, to faithfully preserve the invariant region content with less border artifacts, we propose an Invariant Region-guided Joint Sampling (IRJS) strategy to mitigate the intrinsic sampling errors w.r.t the invariant regions at each denoising timestep and constrain the generated content to be harmonized with the invariant region content. Experimental results verify that ReAtCo consistently improves the controllability of video diffusion editing and achieves superior video editing performance.

Re-Attentional Controllable Video Diffusion Editing

Recent years, multi-hop reasoning has been widely studied for knowledge graph (KG) reasoning due to its efficacy and interpretability. However, previous multi-hop reasoning approaches are subject to two primary shortcomings. First, agents struggle to learn effective and robust policies at the early phase due to sparse rewards. Second, these approaches often falter on specific datasets like sparse knowledge graphs, where agents are required to traverse lengthy reasoning paths. To address these problems, we propose a multi-hop reasoning model with dual agents based on hierarchical reinforcement learning (HRL), which is named FULORA. FULORA tackles the above reasoning challenges by eFficient GUidance-ExpLORAtion between dual agents. The high-level agent walks on the simplified knowledge graph to provide stage-wise hints for the low-level agent walking on the original knowledge graph. In this framework, the low-level agent optimizes a value function that balances two objectives: (1) maximizing return, and (2) integrating efficient guidance from the high-level agent. Experiments conducted on three real-word knowledge graph datasets demonstrate that FULORA outperforms RL-based baselines, especially in the case of long-distance reasoning.

Walk Wisely on Graph: Knowledge Graph Reasoning with Dual Agents via Efficient Guidance-Exploration

Camera-based Bird's Eye View (BEV) perception models receive increasing attention for their crucial role in autonomous driving system, which also raises concerns about their robustness and reliability.
While only a few works have investigated the effects of randomly generated semantic perturbations, aka natural corruptions, on the multi-view BEV detection task, we develop a black-box robustness evaluation framework that adversarially optimises three common semantic perturbations: geometric transformation, colour shifting, and motion blur, to deceive BEV models, which is the first approach in this emerging field.
To address the challenge posed by optimising the semantic perturbation, we design a smoothed, distance-based surrogate function to replace the mAP metric and introduce SimpleDIRECT, a deterministic optimisation algorithm that utilises observed slopes to guide the optimisation process.
By comparing with randomised perturbation and two optimisation baselines, we demonstrate the effectiveness of the proposed framework.
Additionally, we provide a benchmark on the semantic robustness of ten recent BEV models.
The results reveal that PolarFormer, which emphasises geometric information from multi-view images, exhibits the highest robustness, whereas BEVDet is fully compromised, with its precision reduced to zero.

Premium content

Next from AAAI 2025

Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES