United States

Despite the significant role text-to-motion (T2M) generation plays across various applications, current methods involve a large number of parameters and suffer from slow inference speeds, leading to high usage costs.
To address this, we aim to design a lightweight model to reduce usage costs.
First, unlike existing works that focus solely on global information modeling, we recognize the importance of local information modeling in the T2M task by reconsidering the intrinsic properties of human motion, leading us to propose a lightweight Local Information Modeling Module.
Second, we are the first to introduce Mamba to the T2M task, reducing the number of parameters and GPU memory demands, and we have designed a novel Pseudo-bidirectional Scan to replicate the effects of a bidirectional scan without increasing parameter count.
Moreover, we propose a novel Adaptive Textual Information Injector that more effectively integrates textual information into the motion during generation.
By integrating the aforementioned designs, we propose a lightweight and fast model named Light-T2M.
Compared to the state-of-the-art method, MoMask, our Light-T2M model features just **10%** of the parameters (4.48M vs 44.85M) and achieves a **16%** faster inference time (0.152s vs 0.180s), while surpassing MoMask with an FID of **0.040** (vs. 0.045) on HumanML3D dataset and **0.161** (vs. 0.228) on KIT-ML dataset. Source codes will be public.

AAAI 2025

Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Diffusion models have exhibited substantial success in text-to-image generation. However, they often encounter challenges when dealing with complex and dense prompts involving multiple objects, attribute binding, and long descriptions. In this paper, we propose a novel framework called LLM4GEN, which enhances the semantic understanding of text-to-image diffusion models by leveraging the representation of Large Language Models (LLMs). It can be seamlessly incorporated into various diffusion models as a plug-and-play component. A specially designed Cross-Adapter Module (CAM) integrates the original text features of text-to-image models with LLM features, thereby enhancing text-to-image generation. Additionally, to facilitate and correct entity-attribute relationships in text prompts, we develop an entity-guided regularization loss to further improve generation performance. We also introduce DensePrompts, which contains 7,000 dense prompts to provide a comprehensive evaluation for the text-to-image generation task. Experiments indicate that LLM4GEN significantly improves the semantic alignment of SD1.5 and SDXL, demonstrating increases of 9.69% and 12.90% in color on T2I-CompBench, respectively. Moreover, it surpasses existing models in terms of sample quality, image-text alignment, and human evaluation.

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Deep multi-view clustering incorporating graph learning has presented tremendous potential. Most methods encounter costly square time consumption w.r.t. data size. Theoretically, anchor-based graph learning can alleviate this limitation, but related deep models mainly rely on manual discretization approaches to select anchors, which indicates that 1) the anchors are fixed during model training and 2) they may deviate from the true cluster distribution. Consequently, the unreliable anchors may corrupt clustering results. In this paper, we propose the Deep Multi-view Anchor Clustering (DMAC) model that performs clustering in linear time. Concretely, the initial anchors are intervened by the positive perturbation sampled from Gaussian distribution, such that they can be optimized with a newly designed anchor learning loss, which promotes a clear relationship between samples and anchors. Afterwards, anchor graph convolution is devised to model the cluster structure formed by the anchors, and the mutual information maximization loss is built to provide cross-view clustering guidance. In this way, the learned anchors can better represent clusters. With the optimal anchors, the full sample graph is calculated to derive a discriminative embedding for clustering. Extensive experiments on several datasets demonstrate the superior performance and efficiency of DMAC compared to state-of-the-art competitors.

Towards Learnable Anchor for Deep Multi-View Clustering

Event cameras, known for their low latency and high dynamic range, show great potential in pedestrian detection applications. However, while recent research has primarily focused on improving detection accuracy, the robustness of event-based visual models against physical adversarial attacks has received limited attention. For example, adversarial physical objects, such as specific clothing patterns or accessories, can exploit inherent vulnerabilities in these systems, leading to misdetections or misclassifications.
This study is the first to explore physical adversarial attacks on event-driven pedestrian detectors, specifically investigating whether certain clothing patterns worn by pedestrians can cause these detectors to fail, effectively rendering them unable to detect the person. To address this, we developed an end-to-end adversarial framework in the digital domain, framing the design of adversarial clothing textures as a 2D texture optimization problem. By crafting an effective adversarial loss function, the framework iteratively generates optimal textures through backpropagation. Our results demonstrate that the textures identified in the digital domain possess strong adversarial properties. Furthermore, we translated these digitally optimized textures into physical clothing and tested them in real-world scenarios, successfully demonstrating that the designed textures significantly degrade the performance of event-based pedestrian detection models. This work highlights the vulnerability of such models to physical adversarial attacks.

Adversarial Attacks on Event-Based Pedestrian Detectors: A Physical Approach

Interactive Recommendation (IR) has raised much attention recently for its capability to quickly capture dynamic interest and optimize both short and long term objectives. IR agents are typically implemented through Deep Reinforcement Learning (DRL), because DRL is inherently compatible with the dynamic nature of IR. However, DRL is currently not perfect for IR. Due to the large action space and sample inefficiency problem, it’s challenging to train DRL recommender agents. The key point is that useful features cannot be extracted as high-quality representations for the recommender agent to optimize its policy. To tackle this problem, we propose Contrastive Representation for Interactive Recommendation (CRIR). CRIR efficiently extracts latent, high-level preference ranking features from explicit interaction, and leverages the features to enhance users’ representation. Specifically, the CRIR provides representation through one representation network, and refines it through our proposed Preference Ranking Contrastive Learning (PRCL). The key insight of PRCL is that it can perform contrastive learning without relying on computations involving high-level representations or large potential action sets. Furthermore, we also propose a data exploiting mechanism and an agent training mechanism to better adapt CRIR to the DRL backbone. Extensive experiments have been carried out to show our method's superior improvement on the sample efficiency while training an DRL-based IR agent.

Contrastive Representation for Interactive Recommendation

Federated Multi-View Clustering (FMVC) aims to learn a global clustering model from heterogeneous data distributed across different devices, where each device only stores one view of all clustering samples. The key to deal with such problem lies in how to effectively fuse these heterogeneous samples while strictly preserve the data privacy across multiple devices. In this paper, we propose a novel structural graph learning framework named MGCD, which leverages both consistency and diversity of multi-view graph structure across global view-fusion server and local view-specific clients to achieve desired clustering while better preserves data privacy. Specifically, in each local client, we design a dual autoencoder to extract the latent consensuses and specificities of each view, where self-representation construction is introduced to generate the corresponding view-specific diversity graph. In the global server, the consistency implied in uploaded diversity graphs are further distilled and then incorporated into the consistency graph for subsequent cross-view contrastive fusion. During the training process, the server generates a global consistency graph and distributes it to each client for assisting in diversity graph construction, while the clients extract view-specific information and upload it to the server for more reliable consistency graph generation. The ``server-client'' interaction is conducted in an iterative manner, where the consistency implied in each local client is gradually aggregated into the global consistency graph, and the final clustering results are obtained by spectral clustering on the desired global consistency graph. Extensive experiments on various datasets have demonstrated the effectiveness of our proposed method on clustering federated multi-view data.

Graph Consistency and Diversity Measurement for Federated Multi-View Clustering

Catastrophic forgetting is a key challenge in incremental named entity recognition (INER). Existing methods often address this issue through distillation-based approaches, which involve transferring previously learned knowledge from the old model to the new one. However, these methods may not fully equip the new model with an  adequate understanding of the characteristics about old entity types, leading to confusion when classifying tokens associated with these entity types. To address this challenge, we propose a novel method called  $\textbf{P}$rototypical Replay with $\textbf{O}$ld-class $\textbf{F}$ocusing  Knowledge Distillation ($\textbf{POF}$)  for INER. Our approach focuses on preserving the main characteristics of each previous entity type by storing compact prototypes and replaying them with appropriate frequency. This replay strategy makes the new model review the knowledge of old entity types while minimizing storage needs. Additionally, we introduce an old-class focusing knowledge distillation (OFKD) loss, which distills features only in old-class regions to maintain the quality of old-class prototypes and prevent ineffective prototypical replay while preserving sufficient plasticity for learning new entity types. We conducted experiments on three benchmark datasets (i.e., Few-NERD, I2B2 and OntoNotes5), and the results demonstrate that our method outperforms all previous state-of-the-art methods.

Prototypical Replay with Old-class Focusing Knowledge Distillation for Incremental Named Entity Recognition

Recently, Large Vision-Language Model (LVLM), leveraging Large Language Model (LLM) as the cognitive core, has become one of the most representative multimodal model paradigms. However, with the expansion of unimodal branches, \emph{i.e.} visual encoder and LLM, the storage and computational burdens intensify, posing challenges for deployment. Structured pruning has recently proved promising in compressing large models by trimming a large portion of less important network structures. Nevertheless, most of them are predominantly designed for LLMs, either relying on unitary importance metrics that fail to deal with modality-wise imbalances or adopting generic pruning and recovery paradigms that overlook the unique calibration status and capability requirements of large models, leading to substantial performance degradation on LVLMs. To address these issues, we propose Unified Knowledge Maintenance Pruning and Progressive Recovery with Weight Recalling (UKMP), a novel structured pruning approach for LVLMs. Specifically, it introduces a Unified Knowledge Maintenance Importance metric (UKMI) for pruning, which employs adaptive normalization to balance both block-wise and modality-wise discrepancies, refines gradient-based criteria for enhanced accuracy of importance estimation, and incorporates angle distribution information entropy to maintain knowledge capacity. Moreover, we develop a LoRA-based Progressive Distillation (LPD) process that recalls the pruned weights and performs progressive distillation for more comprehensive recovery.
Extensive experimental results across various vision-language tasks demonstrate the effectiveness of our approach, by comparing with state-of-the-art structured pruning methods.

Unified Knowledge Maintenance Pruning and Progressive Recovery with Weight Recalling for Large Vision-Language Models

Person Re-IDentification (ReID) aims to identify specific persons from non-overlapping cameras. Recently, some works have suggested using large-scale pre-trained vision-language models like CLIP to boost ReID performance. Unfortunately, existing methods still struggle to address two key issues simultaneously: efficiently transferring the knowledge learned from CLIP and comprehensively extracting the context information from images or videos.  To address above issues, we introduce CLIMB-ReID, a pioneering hybrid framework that synergizes the impressive power of CLIP with the remarkable computational efficiency of Mamba. Specifically, we first propose a novel Multi-Memory Collaboration (MMC) strategy to transfer CLIP's knowledge in a parameter-free and prompt-free form. Then, we design a Multi-Temporal Mamba (MTM) to capture multi-granular spatiotemporal information in videos. Finally, with Importance-aware Reorder Mamba (IRM), information from various scales is combined to produce robust sequence features. Extensive experiments show that our proposed method outperforms other state-of-the-art methods on both image and video person ReID benchmarks. We will release the source code for reproduction.

CLIMB-ReID: A Hybrid CLIP-Mamba Framework for Person Re-Identification

Concept-based methods have emerged as a promising direction to develop interpretable neural networks in standard supervised settings. However, most works that study them in incremental settings assume either a static concept set across all experiences or assume that each experience relies on a distinct set of concepts. In this work, we study concept-based models in a more realistic, dynamic setting where new classes may rely on older concepts in addition to introducing new concepts themselves. We show that concepts and classes form a complex web of relationships, which is susceptible to degradation and needs to be preserved and augmented across experiences. We introduce new metrics to show that existing concept-based models cannot preserve these relationships even when trained using methods to prevent catastrophic forgetting, since they cannot handle forgetting at concept, class, and concept-class relationship levels simultaneously. To address these issues, we propose a novel method - $\textbf{MuCIL}$ - that uses multimodal concepts to perform classification without increasing the number of trainable parameters across experiences. The multimodal concepts are aligned to concepts provided in natural language, making them interpretable by design. Through extensive experimentation, we show that our approach obtains state-of-the-art classification performance compared to other concept-based models, achieving over 2$\times$ the classification performance in some cases. We also study the ability of our model to perform interventions on concepts, and show that it can localize visual concepts in input images, providing post-hoc interpretations.

Walking the Web of Concept-Class Relationships in Incrementally Trained Interpretable Models

Blended-target domain adaptation (BTDA) leverages learned source knowledge to adapt the model to a blended-target domain that is composed of multiple unlabeled sub-target domains with distinct statistical characteristics. The existing BTDA methods usually overlook semantic correlation information across multiple domains and domain shifts among sub-target domains, resulting in suboptimal adaptation performance. To fully harness semantic knowledge and alleviate domain shifts in hybrid data distribution, we propose a collaborative semantic consistency alignment (CSCA) method for BTDA. Specifically, we achieve distribution alignment by minimizing the sliced Wasserstein distance between the source and target feature distributions. To alleviate complex domain shifts among all sub-target domains in the hybrid feature space, we design graph networks to propagate and share semantic knowledge across domains, which reduces semantic discrepancies among multiple domains. Additionally, we propose a double consistency regularization method to reduce the susceptibility of the model to domain-specific information, further facilitating semantic alignment and alleviating domain shifts. Extensive experiments on several datasets show that CSCA achieves promising classification performance.

Premium content

Next from AAAI 2025

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES