United States

Although Coordinate-MLP-based implicit neural representations have excelled in representing radiance fields, 3D shapes, and images, their application to audio signals remains underexplored. To fill this gap, we investigate existing implicit neural representations, from which we extract 3 types of positional encoding and 16 commonly used activation functions. Through combinatorial design, we establish the first benchmark for Coordinate-MLPs in audio signal representations. Our benchmark reveals that Coordinate-MLPs require complex hyperparameter tuning and frequency-dependent initialization, limiting their robustness. To address these issues, we propose Fourier-ASR, a novel framework based on the Fourier series theorem and the Kolmogorov-Arnold representation theorem. Fourier-ASR introduces Fourier Kolmogorov-Arnold Networks (Fourier-KAN), which leverage periodicity and strong nonlinearity to represent audio signals, eliminating the need for additional positional encoding. Furthermore, a Frequency-adaptive Learning Strategy (FaLS) is proposed to enhance the convergence of Fourier-KAN by capturing high-frequency components and preventing overfitting of low-frequency signals.  Extensive experiments conducted on natural speech and music datasets reveal that: (1) well-designed positional encoding and activation functions in Coordinate-MLPs can effectively improve audio representation quality; and (2) Fourier-ASR can robustly represent complex audio signals without extensive hyperparameter tuning. Looking ahead, the continuity and infinite resolution of implicit audio representations make our research highly promising for tasks such as audio compression, synthesis, and generation. The source code will be released publicly to ensure reproducibility.

AAAI 2025

Representing Sounds as Neural Amplitude Fields: A Benchmark of Coordinate-MLPs and a Fourier Kolmogorov-Arnold Framework

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Human-object interaction (HOI) detection aims to detect the spatial positions of human-object pairs and recognize their interactions. Existing single-branch, two-branch, and three-branch methods are challenging to make an appropriate trade-off on efficiency, multi-task decoupling, and collaborative learning, while they fail to identify rare and complex interaction categories effectively as well. In this work, we propose a novel Efficient Mamba-based Disentangled Progressive Learning (HOIMamba) for HOI Detection to absorb the advantages of the existing three approaches and adaptively aggregate multi-level interaction semantics guided by cross-task bidirectional information contexts. Specifically, HOIMamba builds an efficient and effective decoder through cascaded Low-Rank Adaptations (LoRAs), with high efficiency, thorough decoupling of tasks, and good multi-task collaborative learning. Furthermore, to alleviate the recognition problem of interactions in difficult HOI samples, a novel Mamba-based comprehensive progressive learning strategy with Cross-enhance Mamba (CEM) blocks and Detection Context Propagation (DCP) blocks is designed to gradually excavate interaction-related discriminative cues from four levels. CEM blocks automatically aggregate context to generate diverse task-shared semantics and simultaneously realize the cross-task interaction between human and object branches, guiding the interaction branch to extract more expressive HOI representation. DCP blocks further transfer the comprehensive interaction context to human and object branches to achieve rich and effective information exchange, facilitating the model to discover more HOI instances. Extensive experimental results on two standard benchmarks demonstrate the effectiveness of our HOIMamba.

HOIMamba: Efficient Mamba-based Disentangled Progressive Learning for HOI Detection

When dealing with multi-view data, the heterogeneity of data attributes across different views often leads to label ambiguity. To effectively address this challenge, this paper designs a Multi-View Partial-Label Learning (MVPLL) framework, where each training instance is described by multiple view features and associated with a set of candidate labels, among which only one is correct. The key to deal with such problem lies in how to effectively fuse multi-view information and accurately disambiguate these ambiguous labels. In this paper, we propose a novel approach named CFDM, which explores the consistency and complementary of multi-view data by multi-view contrastive fusion and reduces label ambiguity by multi-class contrastive prototype disambiguation. Specifically, we first extract view-specific representations using multiple view-specific autoencoders, and then integrate multi-view information through both inter-view and intra-view contrastive fusion to enhance the distinctiveness of these representations. Afterwards, we utilize these distinctive representations to establish and update prototype vectors for each class within each view. Based on these, we apply contrastive prototype disambiguation to learn global class prototypes and accordingly reduce label ambiguity. In our model, multi-view contrastive fusion and multi-class contrastive prototype disambiguation are conducted mutually to enhance each other within a coherent framework, leading to a more ideal classification performance. Experimental results on multiple datasets have demonstrated that our proposed method is superior to other state-of-the-art methods.

CFDM: Contrastive Fusion and Disambiguation for Multi-View Partial-Label Learning

This paper considers the problem of *Multi-Hop Video Question Answering (MH-VidQA)* in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences. We develop an automated pipeline to mine multi-hop question-answering pairs with associated temporal evidence, enabling to construct a large-scale dataset for instruction-tuning. To monitor the progress of this new task, we further curate a high-quality benchmark, **MultiHop-EgoQA**, through meticulous manual verification and refinement. Our experiments reveal that existing multi-modal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as **GeLM**, to leverage the world knowledge reasoning capabilities of multi-modal large language models (LLMs), while incorporating a grounding module to retrieve temporal evidence in the video with flexible grounding tokens. Once trained on our constructed visual instruction data, **GeLM** demonstrates enhanced multi-hop grounding and reasoning capabilities, establishing a new baseline for this challenging task. Furthermore, when trained on third-view videos, the same architecture also achieves state-of-the-art performance on the existing single-hop VidQA benchmark, ActivityNet-RTL, showing the architecture's effectiveness.

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

In this work, we focus on semi-supervised learning for video action detection. Video action detection requires spatiotemporal localization in addition to classification and limited amount of labels make the model prone to unreliable predictions. We present Stable Mean Teacher, a simple end to end student teacher based framework which benefits from improved and temporally consistent pseudo labels. It relies on a novel ErrOr Recovery (EoR) module which learns from students’ mistakes on labeled samples and transfer this to the teacher to improve pseudo labels for unlabeled samples. Moreover, existing spatiotemporal losses does not take temporal coherency into account and are prone to temporal inconsistencies. To overcome this, we present Difference of Pixels (DoP), a simple and novel constraint focused on temporal consistency which leads to coherent temporal detections. We evaluate our approach on four different spatiotemporal detection benchmarks, UCF101 24, JHMDB21, AVA and Youtube VOS. Our approach outperforms the supervised baselines for action detection by an average margin of 23.5% on UCF101 24, 16% on JHMDB21, and, 3.3% on AVA. Using merely 10% and 20% of data, it provides a competitive performance compared to the supervised baseline trained on 100% annotations on UCF101 24 and JHMDB21 respectively. We further evaluate its effectiveness on AVA for scaling to large-scale datasets and Youtube VOS for video object segmentation demonstrating its generalization capability to other tasks in the video domain. We will make the code and models publicly available.

Stable Mean Teacher for Semi-supervised Video Action Detection

Functional Magnetic Resonance Imaging (fMRI) data is a widely used kind of four-dimensional biomedical data, which requires effective compression. However, fMRI compressing poses unique challenges due to its intricate temporal dynamics, low signal-to-noise ratio, and complicated underlying redundancies. This paper reports a novel compression paradigm specifically tailored for fMRI data based on Implicit Neural Representation (INR). The proposed approach focuses on removing the various redundancies among the time series by employing several methods, including (i) conducting spatial correlation modeling for intra-region dynamics, (ii) decomposing reusable neuronal activation patterns, and (iii) using proper initialization together with nonlinear fusion to describe the inter-region similarity. This scheme appropriately incorporates the unique features of fMRI data, and experimental results on publicly available datasets demonstrate the effectiveness of the proposed method, surpassing state-of-the-art algorithms in both conventional image quality evaluation metrics and fMRI downstream tasks. This work in this paper paves the way for sharing massive fMRI data at low bandwidth and high fidelity. The source code will be released upon acceptance of the paper.

A Compact Implicit Neural Representation for Efficient Storage of Massive 4D Functional Magnetic Resonance Imaging

Scene Graph Generation (SGG) aims to detect all objects and identify their pairwise relationships existing in the scene. Considering the substantial human labor costs, existing scene graph annotations are often sparse and biased, which result in confusion training with low-frequency predicates. In this work, we design a Semi-Supervised Clustering framework for Scene Graph Generation (SSC-SGG) that uses the sparse labeled data to guide the generation of effective pseudo-labels from unlabeled object pairs, thus enriching the labeled sample space, especially for low-frequency interaction samples. We approach from the perspective of clustering, reducing the problem of confirmation bias in a self-training manner. Specifically, we first enhance the model's robustness to feature extraction via prototype-based clustering, aggregating different relationship augmented features onto the same prototype. Secondly, we design a dynamic pseudo-label assignment algorithm based on a mini-batch, which adjusts the detection sensitivity to different frequency samples from the historical assignment. Finally, we conduct joint training on the pseudo-labels and the labeled data. We conduct experiments on various SGG models and achieve substantial overall performance improvements, demonstrating the effectiveness of SSC-SGG.

Semi-Supervised Clustering Framework for Fine-grained Scene Graph Generation

In multi-dimensional classification (MDC), the classifier chain approach is based on a chain structure to model dependencies between class spaces. However, current research on constructing a chain order is usually based on a greedy criterion or random generation, which is highly likely to lead to an incorrect chain order and fit incorrect class dependencies. Moreover, existing classifier chain-based approaches do not consider the misleading effects of irrelevant input features on the classifiers. To fill the above gap, a classifier chain-based approach incorporating evolutionary chain order optimization and feature selection (ECCO) is proposed. Specifically, this approach designs a meta-heuristic algorithm to optimize the chain order of multiple classifiers. Simultaneously, the approach selects dimension-specific feature combinations that are more conducive to class prediction of each dimension. These strategies enhance the class prediction capability of the constructed MDC model. Comparative experiments on 14 real datasets validate that ECCO outperforms 7 state-of-the-art MDC approaches.

Evolutionary Classifier Chain for Multi-Dimensional Classification

Open vocabulary semantic segmentation is a hot topic in research, focusing on segmenting and recognizing a diverse array of categories in varied environments, including those previously unknown, thereby holding significant practical value. Mainstream studies utilize the CLIP model for direct semantic segmentation (denoted as “forward methods”), which often struggles to represent underrepresented categories effectively. To address this issue, this paper introduces a novel approach **E**xcluding the Impossib**L**e **S**emantic S**E**gmentation Network (**ELSE-Net**) based on reverse thinking. By excluding improbable categories, ELSE-Net narrows the selection range for forward methods, significantly reducing the risk of misclassification.
In implementation, we initially draw on leading research to design the **G**eneral **P**rocessing Block (**GP-Block**), which generates inclusion probabilities (the likelihood of belonging to a category) by using the CLIP model cooperated with a **M**ask **P**roposal **N**etwork (**MPN**). We then present the **EX**cluding the Im**P**ossible Block (**EXP-Block**), which computes exclusion probabilities (the likelihood of not belonging to a category) through the CLIPN model and a custom-designed **R**everse **R**etrieval Adapter (**R$^2$-Adapter**). These exclusion probabilities are subsequently used to refine the inclusion probabilities, which are ultimately employed to annotate class-agnostic masks. Moreover, the core component of our EXP-Block is plug-and-play, enabling it to enhance the capabilities of existing frameworks. Experimental results from four benchmark datasets validate the effectiveness of ELSE-Net and underscore the seamless plug-and-play functionality of the EXP-Block.

Excluding the Impossible for Open Vocabulary Semantic Segmentation

Single-cell transcriptomics describes complex molecular features at the individual cell level, serving various roles in biological research, such as enhancing gene expression and predicting drug responses. Due to transcriptomic data structurally resembling sequential data, many researchers have trained numerous transformers on extensive transcriptomic datasets. However, they have consistently neglected to explore the intrinsic properties of the data and the appropriateness of their chosen model architecture. In this paper, we carefully investigate the nature of transcriptomics, identifying three overlooked problems: 1) long-tailed data problem, 2) model selection problem, and 3) evaluation problem. Consequently, by applying the weighted sampling strategy, we address the long-tailed data problem and achieve consistent improvement across all settings. By adapting different model structures to transcriptomic data, we discover that transformers are not the only option. By developing three downstream tasks and fair evaluation metrics, we establish a simple and comprehensive benchmark to validate the effectiveness of models for transcriptomics. Through extensive experiments, we clarify the misunderstandings in the traditional methods and provide competitive baselines, thereby paving the way for future research in this field.

A Simple and Comprehensive Benchmark for Single-Cell Transcriptomics

Information popularity prediction, aiming to predict the size growth of user participation in a trending topic diffusion, is a fundamental task in social networks. Existing methods usually suffer from treating information diffusion as a single independent process for prediction, and ignoring the ``public opinion field effect'', which means that several trending topics often coexist in the public opinion space and compete for user attention simultaneously. Inspired by Hawkes theory, we propose a novel Hawkes-process-based learning model for information popularity prediction, which takes into account both the temporal correlation among users' propagation behavior in several topic diffusion and public opinion field effect in social networks. We first propose an improved neural Hawkes process to capture comprehensive propagation law from multiple dimensions and then propose a novel public opinion field paradigm based on the improved Hawkes process and cascade structure feature. We design a representation learning framework incorporating the public opinion field paradigm to extract high-quality representations for information popularity prediction. Extensive experiments on four real-world datasets validate that our model significantly outperforms the state-of-the-art competitors.

Premium content

Next from AAAI 2025

HOIMamba: Efficient Mamba-based Disentangled Progressive Learning for HOI Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES