United States

Large Vision-Language Models (LVLMs) have recently garnered significant attention, with many efforts aimed at harnessing their general knowledge to enhance the interpretability and robustness of autonomous driving models. However, LVLMs typically rely on large, general-purpose datasets and lack the specialized expertise required for professional and safe driving. Existing vision-language driving datasets focus primarily on scene understanding and decision-making, without providing explicit guidance on traffic rules and driving skills, which are critical aspects directly related to driving safety. To bridge this gap, we propose IDKB, a large-scale dataset containing over one million data items collected from various countries, including driving handbooks, theory test data, and simulated road test data. Much like the process of obtaining a driver&#39;s license, IDKB encompasses nearly all the explicit knowledge needed for driving from theory to practice. In particular, we conducted comprehensive tests on 15 LVLMs using IDKB to assess their reliability in the context of autonomous driving and provided extensive analysis. We also fine-tuned popular models, achieving notable performance improvements, which further validate the significance of our dataset.

AAAI 2025

Can LVLMs Obtain a Driver’s License? A Benchmark towards Reliable AGI for Autonomous Driving

language and vision

Large Vision-Language Models (LVLMs) have recently garnered significant attention, with many efforts aimed at harnessing their general knowledge to enhance the interpretability and robustness of autonomous driving models. However, LVLMs typically rely on large, general-purpose datasets and lack the specialized expertise required for professional and safe driving. Existing vision-language driving datasets focus primarily on scene understanding and decision-making, without providing explicit guidance on traffic rules and driving skills, which are critical aspects directly related to driving safety. To bridge this gap, we propose IDKB, a large-scale dataset containing over one million data items collected from various countries, including driving handbooks, theory test data, and simulated road test data. Much like the process of obtaining a driver's license, IDKB encompasses nearly all the explicit knowledge needed for driving from theory to practice. In particular, we conducted comprehensive tests on 15 LVLMs using IDKB to assess their reliability in the context of autonomous driving and provided extensive analysis. We also fine-tuned popular models, achieving notable performance improvements, which further validate the significance of our dataset.

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Long-tailed (LT) data distribution is common in multi-label image classification (MLC) and can significantly impact the performance of classification models. One reason is the challenge of learning unbiased instance representations (i.e. features) for imbalanced datasets. Additionally, the co-occurrence of head/tail classes within the same instance, along with complex label dependencies, introduces further challenges. In this work, we delve into this problem through the lens of neural collapse (NC). NC refers to a phenomenon where the last-layer features and classifier of a deep neural network model exhibit a simplex Equiangular Tight Frame (ETF) structure during its terminal training phase. This structure creates an optimal linearly separable state. However, this phenomenon typically occurs in balanced datasets but rarely applies to the typical imbalanced problem. To induce NC properties under Long-tailed multi-label classification (LT-MLC) conditions, we propose an approach named MLC-NC, which aims to learn high-quality data representations and improve the model’s generalization ability. Specifically, MLC-NC accounts for the fact that different labels correspond to different feature parts located in images. MLC-NC extracts class-wise features from each instance through a cross-attention mechanism. To guide the features toward the ETF structure, we introduce visual-semantic feature alignment with a fixed ETF structured label embedding, which helps to learn evenly distributed class centers. To reduce within-class feature variation, we introduce collapse calibration within a lower-dimensional feature space. To mitigate classification bias, we concatenate features and feed them into a binarized fixed ETF classifier. As an orthogonal approach to existing methods, MLC-NC can be seamlessly integrated into various frameworks. Extensive experiments on widely-used benchmarks demonstrate the effectiveness of our method.

MLC-NC: Long-Tailed Multi-Label Image Classification Through the Lens of Neural Collapse

Human Activity Recognition (HAR) aims to recognize activities by training models on massive sensor data. In real-world deployment, a crucial aspect of HAR that has been largely overlooked is that the test sets may have different distributions from training sets due to inter-subject variability including age, gender, behavioral habits, etc., which leads to poor generalization performance. One promising solution is to learn domain-invariant representations to enable a model to generalize on an unseen distribution. However, most existing methods only consider the feature-invariance of the penultimate layer for domain-invariant learning, which leads to suboptimal results. In this paper, we propose a Categorical Concept Invariant Learning (CCIL) framework for generalizable activity recognition, which introduces a concept matrix to regularize the model in the training stage by simultaneously concertrating on feature-invariance and logit-invariance. Our key idea is that the concept matrix for samples belonging to the same activity category should be similar. Extensive experiments on four public HAR benchmarks demonstrate that our CCIL substantially outperforms the state-of-the-art approaches under cross-person, cross-dataset, cross-position, and one-person-to-another settings. Code will be released.

Generalizable Sensor-Based Activity Recognition via Categorical Concept Invariant Learning

With the emergence of vision-language pre-trained models, such as CLIP, some textual prompts have been gradually introduced recently into re-identification (Re-ID) tasks to obtain considerably robust multimodal information. However, most textual descriptions based on vehicle Re-ID tasks only contain identity index words without specific words to describe vehicle view information, thereby resulting in difficulty to be widely applied in vehicle Re-ID tasks with view variations. This case inspires us to propose a CLIP-driven view-aware prompt learning framework for unsupervised vehicle Re-ID. We first design a learnable textual prompt template called view-aware context optimization (ViewCoOp) based on dynamic multi-view word embeddings, which can fully obtain the proportion and position encoding of each view in the whole vehicle body region. Subsequently, a cross-modal mutual graph is constructed to explore the connections between inter-modal and intra-modal. Each sample is treated as a graph node, which extracts textual features based on ViewCoOp and the visual features of images. Moreover, leveraging the inter-cluster and intra-cluster correlation in the bimodal clustering results in the determination of connectivity between graph node pairs. Lastly, the proposed cross-modal mutual graph method utilizes supervised information from the bimodal gap to directly fine-tune the image encoder of CLIP for downstream unsupervised vehicle Re-ID tasks. Extensive experiments verify that the proposed method is capable of effectively obtaining cross-modal description ability from multiple views.

CLIP-driven View-aware Prompt Learning for Unsupervised Vehicle Re-identification

The rise in sophisticated image forgery techniques, driven by advancements in image editing and generation, has posed new security challenges. Traditional methods, designed for specific tampering artifacts, struggle with out-of-distribution image forgery detection. In this paper, we propose a shift in paradigm, placing greater emphasis on the universal characteristics of authentic images, as opposed to solely focusing on specific forgery signals. We introduce an enhancement to the Masked Autoencoder (MAE), aptly termed the Forgery MAE (FMAE). This modification retains the inherent characteristics of natural images while integrating multi-source forgery information. Our implementation involves applying the lottery ticket hypothesis during pre-training to identify forgery-sensitive parameters, followed by their sparse fine-tuning to target the forgery detection and localization task. Concurrently, we develop a ``mixture of experts'' noise extractor to compile multi-source forgery data. Our FMAE effectively extracts forgery features and shows strong resilience against unseen forgeries. Extensive experiments across multiple datasets confirm our method's superior accuracy and generalization capability over existing techniques.

A Lottery Ticket Hypothesis Approach with Sparse Fine-tuning and MAE for Image Forgery Detection and Localization

Gigapixel image analysis, particularly for whole slide images (WSIs), often relies on multiple instance learning (MIL). Under the paradigm of MIL, patch image representations are extracted and then fixed during the training of the MIL classifiers for efficiency consideration.
However, the invariance of representations makes it difficult to perform data augmentation for WSI-level model training, which significantly limits the performance of the downstream WSI analysis. The current data augmentation methods for gigapixel images either introduce additional computational costs or result in a loss of semantic information, which is hard to meet the requirements for efficiency and stability needed for WSI model training. In this paper, we propose a Promptable Representation Distribution Learning framework (PRDL) for both patch-level representation learning and WSI-level data augmentation. Meanwhile, we explore the use of prompts to guide data augmentation in feature space, which achieves promptable data augmentation for training robust WSI-level models. The experimental results have demonstrated that the proposed method stably outperforms state-of-the-art methods.

Promptable Representation Distribution Learning and Data Augmentation for Gigapixel Histopathology WSI Analysis

Explanation for deep learning models on time series classification (TSC) tasks is an important and challenging problem. Most existing approaches use attribution maps to explain outcomes. However, they have limitations in generating explanations that are well-aligned with humans's perceptions. Recently LIME-based approaches provide a more meaningful explanation via segmenting the data. However, these approaches are still suffering from the processes of segment generations and evaluations. In this paper, we propose a novel time series explanation approach called InteDisUX to overcome these problems. Our technique utilizes the segment-level integrated gradient (SIG) for calculating importance scores for an initial set of small and equal segments before iteratively merge two consecutive ones to create better explanations under a unique greedy strategy guided by two new proposed metrics including discrimination and faithfulness gains. By this way, our method does not depend on predefined segments like others while being robusts to instability, poor local fidelity and data imbalance like LIME-based methods. Furthermore, InteDisUX is the first work to use the model's information to improve the set of segments} for time series explanation. Extensive experiments show that our method outperforms LIME-based ones in 12 datasets in terms of faithfulness and 8/12 datasets in terms of robustness.

InteDisUX: Intepretation-Guided Discriminative User-Centric Explanation for Time Series

Federated Semi-Supervised Learning (FSSL) has emerged as a crucial topic in medical image analysis, allowing multiple medical institutions to collaboratively train a global model using limited labeled data. However, existing FSSL methods focus solely on an effective combination of federated learning and semi-supervised learning, ignoring the heterogeneity of client data and the inadaptability of semi-supervised methods in diverse environments, which leads to knowledge bias in local models and impedes stable convergence. To this end, we explore the application of personalization in FSSL and propose a novel dual-calibrated co-training framework. To adapt to the unique feature distribution of client data, we consider collaborative relationships among clients to aggregate a personalized model for each client. We further build a dual-student architecture with the personalized model and private local model on the client side, which encourages model disagreement for co-training while enhancing participant privacy. Most importantly, we design dual calibration strategies that adaptively optimize the model: Local calibration improves the boundary discrimination of the local model by dynamically replacing pseudo-label boundary patches; Global calibration corrects model direction based on the real-time perception of the biases between local dual-student models. Experimental results show the effectiveness of our method on a private medical dataset and two public medical datasets. The code will be available online.

Dual-calibrated Co-training Framework for Personalized Federated Semi-Supervised Medical Image Segmentation

Neural Radiance Field (NeRF)-based volumetric video has revolutionized visual media by delivering photorealistic Free-Viewpoint Video (FVV) experiences that provide audiences with unprecedented immersion and interactivity. However, the substantial data volumes pose significant challenges for storage and transmission. Existing solutions typically optimize NeRF representation and compression independently or focus on a single fixed rate-distortion (RD) tradeoff. In this paper, we propose VRVVC, a novel end-to-end joint optimization variable-rate framework for volumetric video compression that achieves variable bitrates using a single model while maintaining superior RD performance. Specifically, VRVVC introduces a compact tri-plane implicit residual representation for inter-frame modeling of long-duration dynamic scenes, effectively reducing temporal redundancy. We further propose a variable-rate residual representation compression scheme that leverages a learnable quantization and a tiny MLP-based entropy model. This approach enables variable bitrates through the utilization of predefined Lagrange multipliers to manage the quantization error of all latent representations. Finally, we present an end-to-end progressive training strategy combined with a multi-rate-distortion loss function to optimize the entire framework. Extensive experiments demonstrate that VRVVC achieves a wide range of variable bitrates within a single model and surpasses the RD performance of existing methods across various datasets.

VRVVC: Variable-Rate NeRF-Based Volumetric Video Compression

Textual Graphs (TGs) present a graph-based representation of textual data and find wide applications in real-world scenarios, such as citation networks, knowledge graphs, and social networks. While the traditional ”pre-train, fine-tune” framework effectively addresses tasks requiring abundant labeled data, it falls short in scenarios with limited resource or zero-shot learning capabilities, particularly in low-resource textual graph node classification. Additionally, prevalent approaches that convert text nodes into shallow or manually engineered
features fail to capture the rich semantic nuances within the text. And the conventional methods often neglect the fusion of semantic and topological information, resulting in suboptimal model learning. To overcome these challenges, we proposed a novel method of low-resource textual graph node classification based on large language models, i.e., Textual graph learning with semantic and topological awareness (TGL$_{sta}$), which comprehensively explores the semantic information, near neighborhood information, and the topology information in textual graphs, where these components are the most important information source contained in textual graphs. And graph prompt tuning for both zero- and few-shot textual graph node classification is further introduced. Through extensive experimentation across various benchmarks, our approach consistently outperforms existing methods, offering promising advancements in the field.

TGLsta: Low-resource Textual Graph Learning with Semantic and Topological Awareness via LLMs

We propose a 2D simulation system for multi-agent collective construction (MACC) based on simple line-following intelligent machines (SLIM) - small differential drive mobile robots. Our MACC-SLIM system alleviates the high upfront cost of implementing MACC on real hardware. Our system builds upon widely available resources, namely a standard LCD screen and commodity mobile robots, allowing researchers and schools easier access to MACC hardware implementation. We test the system on plans generated by an optimal state-of-the-art MACC algorithm, demonstrating there are still non-insignificant synchronization delays. The MACC-SLIM system allows us to observe bottlenecks, parallelism, and possible execution failures of plans generated by the MACC algorithms.

Premium content

Next from AAAI 2025

MLC-NC: Long-Tailed Multi-Label Image Classification Through the Lens of Neural Collapse

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES