United States

Spike cameras, as innovative neuromorphic devices, generate continuous spike streams to capture high-speed scenes with lower bandwidth and higher dynamic range than traditional RGB cameras. However, reconstructing high-quality video from these inputs, especially under low-light conditions, remains challenging. Traditional methods often rely on synthetic data or basic reconstructions as supervision signals, but these approaches falter when dealing with noisy or low-quality spike signals, leading to performance degradation. This is primarily due to inadequate noise modeling, the domain gap between synthetic and real datasets, and the reliance on low-quality pseudo labels, resulting in images with unclear textures, excessive noise, and diminished brightness.
To address these challenges, we introduce a novel reconstruction framework that goes beyond traditional training paradigms. Instead of relying solely on visual data, we incorporate textual descriptions and unpaired high-quality datasets as new forms of supervision. Textual descriptions provide additional context that guides the network&#39;s feature reconstruction, while high-quality datasets help produce sharp latent images. Our experiments on real-world low-light datasets, such as U-CALTECH and U-CIFAR, demonstrate that this approach significantly enhances texture clarity and luminance balance. Furthermore, the reconstructed images are well-aligned with the broader visual features needed for downstream tasks, ensuring more robust and versatile performance in challenging environments.

AAAI 2025

Rethinking High-speed Image Reconstruction Framework with Spike Camera

Spike cameras, as innovative neuromorphic devices, generate continuous spike streams to capture high-speed scenes with lower bandwidth and higher dynamic range than traditional RGB cameras. However, reconstructing high-quality video from these inputs, especially under low-light conditions, remains challenging. Traditional methods often rely on synthetic data or basic reconstructions as supervision signals, but these approaches falter when dealing with noisy or low-quality spike signals, leading to performance degradation. This is primarily due to inadequate noise modeling, the domain gap between synthetic and real datasets, and the reliance on low-quality pseudo labels, resulting in images with unclear textures, excessive noise, and diminished brightness.
To address these challenges, we introduce a novel reconstruction framework that goes beyond traditional training paradigms. Instead of relying solely on visual data, we incorporate textual descriptions and unpaired high-quality datasets as new forms of supervision. Textual descriptions provide additional context that guides the network's feature reconstruction, while high-quality datasets help produce sharp latent images. Our experiments on real-world low-light datasets, such as U-CALTECH and U-CIFAR, demonstrate that this approach significantly enhances texture clarity and luminance balance. Furthermore, the reconstructed images are well-aligned with the broader visual features needed for downstream tasks, ensuring more robust and versatile performance in challenging environments.

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Detecting anomalies in business processes is crucial for ensuring operational success. While many existing methods rely on statistical frequency to detect anomalies, it's important to note that infrequent behavior doesn't necessarily imply undesirability. To address this challenge, detecting anomalies from a semantic viewpoint proves to be a more effective approach. However, current semantic anomaly detection methods  treat a trace (i.e., process instance) as multiple event pairs, disrupting long-distance dependencies. In this paper, we introduce DABL, a novel approach for detecting semantic anomalies in business processes using large language models (LLMs). We collect 143,137 real-world process models from various domains. By generating normal traces through the playout of these process models and simulating both ordering and exclusion anomalies, we fine-tune Llama 2 using the resulting log. Through extensive experiments, we demonstrate that DABL surpasses existing state-of-the-art semantic anomaly detection methods in terms of both generalization ability and learning of given processes. Users can directly apply DABL to detect semantic anomalies in their own datasets without the need for additional training. Furthermore, DABL offers the ability to interpret anomalies' causes in natural language, providing valuable insights into the detected anomalies.

DABL: Detecting Semantic Anomalies in Business Processes Using Large Language Models

We propose GGS, a Generalizable Gaussian Splatting method for Autonomous Driving which can achieve realistic rendering under large viewpoint changes. Previous generalizable 3D gaussian splatting methods are limited to rendering novel views that are very close to the original pair of images, which cannot handle large differences in viewpoint. Especially in autonomous driving scenarios, images are typically collected from a single lane. The limited training perspective makes rendering images of a different lane very challenging. To further improve the rendering capability of GGS under large viewpoint changes,  we introduces a novel virtual lane generation module into GSS method
to enables high-quality lane switching even without a multi-lane dataset. Besides, we design a diffusion loss to supervise the generation of virtual lane image to further address the problem of lack of data in the virtual lanes. 
Finally, we also propose a depth refinement module to optimize depth estimation in the GSS model. Extensive validation of our method, compared to existing approaches, demonstrates state-of-the-art performance.

GGS: Generalizable Gaussian Splatting for Lane Switching in Autonomous Driving

Multi-object tracking faces a major challenge in handling the variations of tracked targets within complex scenes. In existing transformer-based tracking methods, typically each tracked target is only associated with one track query. However, trajectories in crowded scenes often experience varying levels of occlusion, making the association brittle for using a single track query to identify the tracked target. Therefore, we argue that relying on a single track query to track a target in complex scenes is inadequate. In this paper, we introduce TGFormer, with the core idea of designing a Track Query Group for each tracked target. Each group encompasses track queries that handle the same tracked target across different levels of occlusion scenes. To achieve long-term robust association, we propose a novel updater that integrates temporal memories and occlusion-aware features to update the Track Query Group, ensuring the tracked target can be consistently captured in complex scenes. Additionally, we introduce a Position Predictor that allows TGFormer to forecast motion trends, helping the model accurately locate moving tracklets. Experiments show that our method enhances the metrics of the baseline method on the MOT Challenge and DanceTrack datasets, showing highly competitive performance.

TGFormer: Transformer with Track Query Group for Multi-Object Tracking

Contrastive learning underpins most current self-supervised time series representation methods. The strategy for constructing positive and negative sample pairs significantly affects the final representation quality. However, due to the continuous nature of time series semantics, the modeling approach of contrastive learning struggles to accommodate the characteristics of time series data. This results in issues such as difficulties in constructing hard negative samples and the potential introduction of inappropriate biases during positive sample construction. Although some recent works have developed several scientific strategies for constructing positive and negative sample pairs with improved effectiveness, they remain constrained by the contrastive learning framework. To fundamentally overcome the limitations of contrastive learning, this paper introduces Frequency-masked Embedding Inference (FEI), a novel non-contrastive method that completely eliminates the need for positive and negative samples. The proposed FEI constructs 2 inference branches based on a prompting strategy: 1) Using frequency masking as prompts to infer the embedding representation of the target series with missing frequency bands in the embedding space, and 2) Using the target series as prompts to infer its frequency masking embedding. This enables continuous semantic relationship modeling for time series. Experiments on 8 widely used time series datasets for classification and regression tasks, using linear evaluation and end-to-end fine-tuning, show that FEI significantly outperforms existing contrastive-based methods in terms of generalization. This study provides new insights into self-supervised representation learning for time series.

Frequency-Masked Embedding Inference: A Non-Contrastive Approach for Time Series Representation Learning

Semi-supervised medical image segmentation (SSMIS) uses consistency learning to regularize model training, which alleviates the burden of pixel-wise manual annotations. However, it often suffers from error supervision from low-quality pseudo labels. Vision-Language Model (VLM) has great potential to enhance pseudo labels by introducing text prompt guided multimodal supervision information. It nevertheless faces the cross-modal problem: the obtained messages tend to correspond to multiple targets. To address aforementioned problems, we propose a Dual Semantic Similarity-Supervised VLM (DuSSS) for SSMIS. Specifically, 1) a Dual Contrastive Learning (DCL) is designed to improve cross-modal semantic consistency by capturing intrinsic representations within each modality and semantic correlations across modalities. 2) To encourage the learning of multiple semantic correspondences, a Semantic Similarity-Supervision strategy (SSS) is proposed and injected into each contrastive learning process in DCL, supervising semantic similarity via the distribution-based uncertainty levels. Furthermore, a novel VLM-based SSMIS network is designed to compensate for the quality deficiencies of pseudo-labels. It utilizes the pretrained VLM to generate text prompt guided supervision information, refining the pseudo label for better consistency regularization. Experimental results demonstrate that our DuSSS achieves outstanding performance with Dice of 82.52%, 74.61% and 78.03% on three public datasets (QaTa-COV19, BM-Seg and MoNuSeg). Our code will be available.

DuSSS: Dual Semantic Similarity-Supervised Vision-Language Model for Semi-Supervised Medical Image Segmentation

Referring Multi-Object Tracking (RMOT) is an important topic in the current tracking field. Its task form is to guide the tracker to track objects that match the language description. Current research mainly focuses on referring multi-object tracking under single-view, which refers to a view sequence or multiple unrelated view sequences. However, in the single-view, some appearances of objects are easily invisible, resulting in incorrect matching of objects with the language description. In this work, we propose a new task, called Cross-view Referring Multi-Object Tracking (CRMOT). It introduces the cross-view to obtain the appearances of objects from multiple views, avoiding the problem of the invisible appearances of objects in RMOT task. CRMOT is a more challenging task of accurately tracking the objects that match the language description and maintaining the identity consistency of objects in each cross-view. To advance CRMOT task, we construct a cross-view referring multi-object tracking benchmark based on CAMPUS and DIVOTrack datasets, named CRTrack. Specifically, it provides 13 different scenes and 221 language descriptions. Furthermore, we propose an end-to-end cross-view referring multi-object tracking method, named CRTracker. Extensive experiments on the CRTrack benchmark verify the effectiveness of our method.

Cross-View Referring Multi-Object Tracking

As digital media manipulation becomes increasingly sophisticated, accurately detecting and localizing image forgeries with minimal supervision has become a critical challenge. 
Existing weakly supervised image forgery detection (W-IFD) methods often rely on convolutional neural networks (CNNs) and limited exploration of internal relationships, leading to poor detection and localization performance with only image-level labels. 
To address these limitations, we introduce a novel Multi-View and Multi-Level Relation Learning Network (M$^2$RL-Net) for W-IFD. 
M$^2$RL-Net effectively identifies forged images using only image-level annotations by exploring relationships between different views and hierarchical levels within images. 
Specifically, M$^2$RL-Net achieves patch-level self-consistency learning (PSL) and feature-level contrastive learning (FCL) across different views, facilitating more generalized self-supervised learning of forgery features.
In detail, PSL employs self-supervised learning to distinguish consistent and inconsistent regions within images, enhancing its ability to accurately locate tampered areas. 
FCL utilizes feature-level self-view and multi-view contrastive learning to differentiate between genuine and tampered image features, thereby improving the recognition of authentic and manipulated content across different views.
Extensive experiments on various datasets demonstrate that M$^2$RL-Net outperforms existing weakly-supervised methods in both detection and localization accuracy. This research sets a new benchmark for weakly-supervised image forgery detection and lays a robust foundation for future studies in this field.

M²RL-Net: Multi-View and Multi-Level Relation Learning Network for Weakly-Supervised Image Forgery Detection

Prompt instruction tuning is a popular approach to better adjust monolithic, pretrained LLMs for downstream tasks. Yet, how to extend this approach to handle multiple tasks and data distributions is an open question.  To address this gap, we propose using \emph{Mixture of Prompts} (MoPs) with smart gating functionality. The latter identifies relevant skills embedded in different groups of prompts and dynamically weighs experts (i.e., collection of prompts) based on the target task. Experiments show that MoPs are resilient to model compression, data source, and task composition, making them highly versatile and applicable in various contexts. 
In practice, MoPs can simultaneously mitigate prompt training ``interference'' in multi-task, multi-source scenarios (e.g., task and data heterogeneity across sources) and possible implications from model approximations. Empirically, MoPs can reduce final perplexity from 9\% up to 70\% in federated cases and from 3\% up to 30\% in centralized cases, compared to baselines.

Sweeping Heterogeneity with Smart MoPs: Mixture of Prompts for LLM Task Adaptation

We consider the Coalition Structure Learning (CSL) problem in multi-agent systems, motivated by the existence of coalitions in many real-world systems, e.g., trading platforms and auction systems. In this problem, there is a hidden coalition structure within a set of $n$ agents, which affects the behavior of the agents in games. Our goal is to actively design a sequence of games for the agents to play, such that observations in these games can be used to learn the hidden coalition structure. In particular, we consider the setting where in each round, we design and present a game together with a strategy profile to the agents, and receive a multiple-bit observation -- for each agent, we observe whether or not they would like to deviate from the specified strategy. We show that we can learn the coalition structure in $O(\log n)$ rounds if we are allowed to design any normal-form game, matching the information-theoretical lower bound. For practicality, we extend the result to settings where we can only choose games of a specific format, and design algorithms to learn the coalition structure in these settings. For most settings, our complexity matches the theoretical lower bound up to a constant factor.

Deviate or Not: Learning Coalition Structures with Multiple-bit Observations in Games

The purpose of partial multi-label feature selection is to select the most representative feature subset, where the data comes from partial multi-label datasets that have label ambiguity issues. For label disambiguation, previous methods mainly focus on utilizing the information inside the labels and the relationship between the labels and features. However, the information existing in the feature space is rarely considered, especially in partial multi-label scenarios where the noises is considered to be concentrated in the label space while the feature information is correct. This paper proposes a method based on  latent space alignment, which uses the information mined in feature space to disambiguate in latent space through the structural consistency between labels and features. In addition, previous methods overestimate the consistency of features and labels in the latent space after convergence. We comprehensively consider the similarity of latent space projections to feature space and label space, and propose new feature selection term. This method also significantly improves the positive label identification ability of the selected features. Comprehensive experiments demonstrate the superiority of the proposed method.

Premium content

Next from AAAI 2025

DABL: Detecting Semantic Anomalies in Business Processes Using Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES