United States

A major challenge in Reinforcement Learning (RL) is the difficulty of learning an optimal policy from sparse rewards. Prior works enhance online RL with conventional Imitation Learning (IL) via a handcrafted auxiliary objective, at the cost of restricting the RL policy to be sub-optimal when the offline data is generated by a non-expert policy. Instead, to better leverage valuable information in offline data, we develop Generalized Imitation Learning from Demonstration (GILD), which meta-learns an objective that distills knowledge from offline data and instills intrinsic motivation towards the optimal policy. Distinct from prior works that are exclusive to a specific RL algorithm, GILD is a flexible module intended for diverse vanilla off-policy RL algorithms. In addition, GILD introduces no domain-specific hyperparameter and minimal increase in computational cost. In four challenging MuJoCo tasks with sparse rewards, we show that three RL algorithms enhanced with GILD significantly outperform state-of-the-art methods. Our code and data are available at https://anonymous.4open.science/r/GILD-087E/.

AAAI 2025

Enhancing Online Reinforcement Learning with Meta-Learned Objective from Offline Data

reinforcement learning

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Location-Based Social Networks (LBSNs) offer a rich dataset of user activity at Points-of-Interest (POIs), making next POI recommendation a key task. Traditional algorithms face challenges due to broad searching scopes, affecting recommendation accuracy. Users tend to visit nearby POIs and show temporal concentration in their activities, reflecting personalized spatio-temporal clustering. However, individual user data may be insufficient to capture these clustering effects for personalized recommendations. In this paper, we propose an integrated Personalized Spatio-Temporal Clustering Model (iPCM) for next POI recommendation. The model learns this kind of personalized spatio-temporal clustering effect by using global historical trajectory data in conjunction with user feature embeddings. It integrates the features of personalized spatio-temporal clustering with the user's trajectory, and completes the user's POI recommendation through a Transformer encoding and MLP decoding. To enhance the accuracy of predictions, we add a module of probability adjustment. The experimental results on multiple datasets show that with the help of personalized spatio-temporal clustering, the proposed iPCM is superior to existing methods in various evaluation metrics.

Integrating Personalized Spatio-Temporal Clustering for Next POI Recommendation

3D single object tracking (3D SOT) in LiDAR point clouds is essential for autonomous driving. Most existing 3D SOT methods focus on clear weather, where point clouds are more defined. However, adverse weather conditions lead to sparser and noisier point clouds, significantly degrading tracking performance and posing safety risks. In this study, we introduce UAWTrack, a universal 3D SOT model designed to perform effectively across diverse real-world weather conditions. UAWTrack comprises three key modules: 1) Voxel Feature Extraction, which mitigates the perturbations in point clouds caused by adverse weather; 2) Motion-centric Spatial-temporal Aggregation and Motion-guided Feature Fusion, capturing motion clues and sampling dense BEV motion features to address the issue of sparsity; and 3) Weather-Specific Tracker, which efficiently handles tracking in various weather conditions. To fill the gap of lacking benchmarks for 3D SOT in adverse weather, we simulate physically valid adverse weather conditions on the KITTI and NuScenes datasets, creating two benchmarks: KITTI-A and NuScenes-A. Extensive experiments demonstrate that UAWTrack achieves state-of-the-art performance under all weather conditions. The code will be released.

UAWTrack: Universal 3D Single Object Tracking in Adverse Weather

Visual object tracking is essentially crucial for unmanned aerial vehicles (UAVs). Despite the substantial progress, most of the existing UAV trackers are designed for well-conditioned daytime data, while for the scenarios in challenging weather condition, e.g. foggy or nighttime environment, the tremendous domain gap leads to significant performance degradation. To address this issue, in this paper, we propose a novel robust UAV tracker termed LVPT-Track, which conducts high quality label-aligned visual prompt tuning to adapt to various challenging weather conditions. Specifically, we first synthesize the sequential foggy and nighttime video frames to assist the model training. An domain adaptive teacher-student network is utilized to distill the hierarchical visual semantic of the target objects in cross-domain scenarios. Then we propose a target-aware pseudo-label voting (PLV) strategy to alleviate the target-level misalignment in the dual domains. Furthermore, we propose a dynamic aggregated prompt (DAP) module to facilitate the significant appearance variation of the target object in challenging scenarios. Extensive experiments demonstrate that our tracker achieves superior performance over existing state-of-the-art UAV trackers.

LVPTrack: High Performance Domain Adaptive UAV Tracking with Label Aligned Visual Prompt Tuning

Few-shot classification (FSC) is a fundamental yet challenging task in computer vision that involves recognizing novel classes from limited data. While previous methods have focused on enhancing visual features or incorporating additional modalities, Large Vision Language Models (LVLMs) offer a promising alternative due to their rich knowledge and strong visual perception. However, LVLMs risk learning specific response formats rather than effectively extracting useful information from support data in FSC tasks.
In this paper, we investigate LVLMs' performance in FSC and identify key issues such as insufficient learning and the presence of severe positional biases.
To tackle above challenges, we adopt the meta-learning strategy to teach models ``learn to learn". By constructing a rich set of meta-tasks for instruction fine-tuning, LVLMs enhance the ability to extract information from few-shot support data for classification. 
Additionally, we further boost LVLM's few-shot learning capabilities through label augmentation and candidate selection in the fine-tuning and inference stage, respectively. Label augmentation is implemented via a character perturbation strategy to ensure the model focuses on support information. Candidate selection leverages attribute descriptions to filter out unreliable candidates and simplify the task.
Extensive experiments demonstrate that our approach achieves superior performance on both general and fine-grained datasets. Furthermore, our candidate selection strategy has been proved beneficial for training-free LVLMs.
Our code will be made available upon acceptance of the paper.

Making Large Vision Language Models to Be Good Few-Shot Learners

Unsupervised domain adaptation (UDA) refers to a domain adaptation framework in which a learning model is trained based on the labeled samples on the source domain and unlabelled ones in the target domain. The dominant existing methods in the field that rely on the classical covariate shift assumption to learn domain-invariant feature representation have yielded suboptimal performance under label distribution shift. In this paper, we propose a novel Conditional Adversarial SUpport ALignment (CASUAL) whose aim is to minimize the conditional symmetric support divergence between the source’s and target domain’s feature representation distributions, aiming at a more discriminative representation for the classification task. We also introduce a novel theoretical target risk bound, which justifies the merits of aligning the supports of conditional feature distributions compared to the existing marginal support alignment approach in the UDA settings. We then provide a complete training process for learning in which the objective optimization functions are precisely based on the proposed target risk bound. Our empirical results demonstrate that CASUAL outperforms other state-of-the-art methods on different UDA benchmark tasks under different label shift conditions.

CASUAL: Conditional Support Alignment for Domain Adaptation with Label Shift

The introduction of Feature Pyramid Network (FPN) has significantly improved object detection performance. However, substantial challenges persist when detecting tiny objects. The features of tiny objects occupy a very small proportion of the feature maps. Although FPN integrates multi-scale features, it does not directly enhance or enrich the features of tiny objects. Furthermore, FPN lacks spatial perception ability. To address these issues, we propose a novel High Frequency and Spatial Perception Feature Pyramid Network (HS-FPN) containing two innovative modules. First, we designed a high frequency perception module (HFP) that generates high frequency responses through high pass filters. These high frequency responses are used as mask weights from both spatial and channel perspectives to enrich and highlight the features of tiny objects in the original feature maps. Second, we developed a spatial dependency perception module (SDP) to capture the spatial dependencies FPN lacks. Our experiments demonstrate that detector based on HS-FPN exhibits competitive advantages over state-of-the-art models on the AI-TOD dataset for tiny object detection.

HS-FPN: High Frequency and Spatial Perception FPN for Tiny Object Detection

Text-to-image diffusion models significantly enhance the efficiency of artistic creation with high-fidelity image generation. However, in typical application scenarios like comic book production, they can neither place each subject into its expected spot nor maintain the consistent appearance of each subject across images. For these issues, we pioneer a novel task, *Layout-to-Consistent-Image* (L2CI) generation, which produces consistent and compositional images in accordance with the given layout conditions and text prompts. To accomplish this challenging task, we present a new formalization of *dual energy guidance* with optimization in a dual semantic-latent space and thus propose a training-free pipeline, __SpotActor__, which features a layout-conditioned backward update stage and a consistent forward sampling stage. In the backward stage, we innovate a nuanced layout energy function to mimic the attention activations with a sigmoid-like objective. While in the forward stage, we design *Regional Interconnection Self-Attention* (RISA) and *Semantic Fusion Cross-Attention* (SFCA) mechanisms that allow mutual interactions across images. To evaluate the performance, we present __ActorBench__, a specified benchmark with hundreds of reasonable prompt-box pairs stemming from object detection datasets. Comprehensive experiments are conducted to demonstrate the effectiveness of our method. The results prove that SpotActor fulfills the expectations of this task and showcases the potential for practical applications with superior layout alignment, subject consistency, prompt conformity and background diversity.

SpotActor: Training-Free Layout-Controlled Consistent Image Generation

Identifying the Markov properties or conditional independencies of a collection of random variables is a fundamental task in statistics for modeling and inference. Existing approaches often learn the structure of a probabilistic graph, which encodes these dependencies, by assuming the variables follow a distribution with a simple parametric form. Moreover, the computational cost of many algorithms scales poorly for high-dimensional distributions, as they need to estimate all the edges in the graph simultaneously. In this work, we propose a scalable algorithm to infer the conditional independence relationships of each variable by exploiting the local Markov property. The proposed method, named Localized Sparsity Identification for Non-Gaussian Distributions (SING), estimates the graph without restricting the family of conditional distributions for each variable. We show that the localized SING algorithm includes existing approaches, such as neighborhood selection with Lasso, as a special case. We demonstrate the effectiveness of our algorithm in both Gaussian and non-Gaussian settings compared to existing methods. Lastly, we show the scalability of the proposed approach by applying it to high-dimensional non-Gaussian examples, including a biological dataset with more than 150 variables.

Learning Local Neighborhoods of Non-Gaussian Graphical Models

Time Series Forecasting aims at predicting future values for a time series and plays a crucial role in many real-world applications, e.g., finance, disease spread, or weather prediction. However, it is also a very challenging task, especially for long-term forecasting. In this paper, we introduce WaveletMixer, an iterative multi-levels, multi-resolutions and multi-phases approach to effectively capture with long-term dependencies of multivariate time series in both global and local perspectives for improving forecasting accuracy. WaveletMixer fundamentally differs to existing works in the following key aspects. First, it exploits multi-levels properties of Wavelet transformation to create multiple forecasting models for different frequency domains at different level of resolutions. Second, the relationships among different frequency domains are exploited to iteratively adjust all prediction models at all levels simultaneously in both local and global perspectives to reduce prediction errors and biases, thus significantly improving the final prediction accuracy. Third, while WaveletMixer is a general framework that can be used to boost performance of any deep-learning architecture (e.g., MLP, LSTM or Transformer), we additionally introduce TS-Learner, an MLP-based model to further enhance the performance in long-term forecasting. Extensive experiments have conducted on nine real-world datasets to demonstrate the performance of WaveletMixer compared to SOTA methods and to reveal its important characteristics. Code and extended experimental results are available in the supplementary material.

WaveletMixer: A Multi-Resolution Wavelets Based MLP-Mixer for Multivariate Long-Term Time Series Forecasting

Continual Learning (CL) is a highly relevant setting gaining traction in recent machine learning research. Among CL
works, architectural and hybrid strategies are particularly effective due to their potential to adapt the model architecture as
new tasks are presented. However, many existing solutions do not efficiently exploit model sparsity, and are prone to capacity 
saturation due to their inefficient use of available weights, which limits the number of learnable tasks. In this paper, we
propose TinySubNets (TSN), a novel architectural CL strategy that addresses the issues through the unique combination of
pruning with different sparsity levels, adaptive quantization, and weight sharing. Pruning identifies a subset of weights that
preserve model performance, making less relevant weights available for future tasks. Adaptive quantization allows a single 
weight to be separated into multiple parts which can be assigned to different tasks. Weight sharing between tasks boosts
the exploitation of capacity and task similarity, allowing for the identification of a better trade-off between model accuracy
and capacity. These features allow TSN to efficiently leverage the available capacity, enhance knowledge transfer, and reduce
computational resources consumption. Experimental results involving common benchmark CL datasets and scenarios show
that our proposed strategy achieves better results in terms of accuracy than existing state-of-the-art CL strategies. Moreover, 
our strategy is shown to provide a significantly improved model capacity exploitation.

Premium content

Next from AAAI 2025

Integrating Personalized Spatio-Temporal Clustering for Next POI Recommendation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES