United States

Pretrained Language Models (PLMs) have become the de facto starting point for fine-tuning on downstream tasks.
However, as model sizes continue to increase, traditional fine-tuning of all parameters becomes challenging.
To address this, parameter-efficient fine-tuning (PEFT) methods have gained popularity as a means to adapt PLMs effectively. In parallel, recent studies have revealed the presence of activation sparsity within the intermediate outputs of the multilayer perception (MLP) blocks in transformers. Low activation density enables efficient model inference on sparsity-aware hardware. Building upon this insight, in this work, we propose a novel density loss that encourages higher activation sparsity (equivalently, lower activation density) in the pre-trained models. We demonstrate the effectiveness of our approach by utilizing mainstream PEFT techniques, including QLoRA, LoRA, Adapter, and Prompt/Prefix Tuning, to facilitate efficient model adaptation across diverse downstream tasks. Experiments show that our proposed method, \textbf{DEFT} (Density-Efficient Fine-Tuning), can consistently reduce activation density by up to 44.94% on $RoBERTa_{Large}$ and by 53.19 (encoder density) and 90.60% (decoder density) on $Flan-T5_{XXL}$ (11B) compared to PEFT, using GLUE and QA (SQuAD) benchmarks respectively, while maintaining competitive performance on downstream tasks. We also introduce \textbf{ADA-DEFT}, an adaptive variant of our DEFT approach, which achieves significant memory and runtime savings during inference for large models. For instance, ADA-DEFT reduces runtime by 8.75% and memory usage by 16.78% in $Flan-T5_{XL}$, and by 2.79% and 2.54% respectively in $Flan-T5_{XXL}$. Additionally, we showcase that DEFT works complementarily with quantized and pruned models.

AAAI 2025

From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers

learning on the edge

model compression

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Monitoring a large population of dynamic processes with limited resources presents a significant challenge across various industrial sectors. This is due to 1) the inherent disparity between the available monitoring resources and the extensive number of processes to be monitored and 2) the unpredictable and heterogeneous dynamics inherent in the progression of these processes. Online learning approaches, commonly referred to as bandit methods, have demonstrated notable potential in addressing this issue by dynamically allocating resources and effectively balancing the exploitation of high-reward processes and the exploration of uncertain ones. However, most online learning algorithms are designed for 1) a centralized setting that requires data sharing across processes for accurate predictions or 2) a homogeneity assumption that estimates a single global model from decentralized data. To overcome these limitations and enable online learning in a heterogeneous population under a decentralized setting, we propose a federated collaborative online monitoring method. Our approach utilizes representation learning to capture the latent representative models within the population and introduces a novel federated collaborative UCB algorithm to estimate these models from sequentially observed decentralized data. This strategy facilitates informed monitoring resource allocation. The efficacy of our method is demonstrated through theoretical analysis, simulation studies, and its application to decentralized cognitive degradation monitoring in Alzheimer’s disease.

FCOM: A Federated Collaborative Online Monitoring Framework via Representation Learning

Recent advances in diffusion models focus on efficiently handling conditional generative tasks without extra training. The process involves decomposing the result into two components: 1. unconditional sample $u\_{t}$, generated in the absence of conditions. 2. condition correction $\phi$, adjusting $u\_{t}$ to include the guidance image $I\_c$. This adjustment is quantified by the pixel-level measure $\\| \mathcal{A}(\mathcal{D}(\mathfrak{C}'\_{\phi}(u\_t))) - I\_c \\|\_2$, where $\mathcal{D}(z\_t)$ decodes the latent code $z\_{t} = \mathfrak{C}'\_{\phi}(u\_t) = u\_t + \phi $, and the forward operator $\mathcal{A}( x\_t)$ translates the noisy image $ x\_t=\mathcal{D}(z\_t)$ into the guidance domain for comparison with the guidance image $I\_c$. To enhance the fidelity of $\phi$, we propose a learnable latent forward operator $\mathfrak{A}\_{ \theta}(z\_t, t)$, focusing on latent-space consistency  $\\| \mathfrak{A}\_{\theta}(\mathfrak{C}'\_{\phi}(u\_t), t) - \mathcal{E}(I\_c) \\|\_2$ with the expectation that this latent-space consistency approximates the pixel-level fidelity measure. Here, $\mathcal{E}(I\_c)$ acts as the encoder, mapping the guidance image into the latent space. Furthermore a correctional operator $\mathfrak{C}''\_{\psi}(z\_t) = z\_t + \psi $ is proposed to rectify model mismatching in the latent guidance model, thereby refining the consistency constraint to $\\| \mathfrak{A}\_{\theta}(\mathfrak{C}''\_{\psi}(\mathfrak{C}'\_{\phi}(u\_t)), t) - \mathcal{E}(I\_c) \\|\_2$. The determination of the condition term $\phi$ and the correction estimation $\psi$ is akin to solving a blind inverse problem. Our EMControl employs the Expectation-Maximization (EM) algorithm to solve the blind inverse problem during the reverse sampling process. This technique ensures that samples, once consistent with the guidance, are accurately mapped back onto the noisy data manifold, adhering to the data's inherent distribution. The EMControl has proven its effectiveness by delivering superior performance in conditional diffusion generation tasks compared to previous approaches. Moreover, its application to multiple-condition scenarios underscores its versatility and robustness across a range of generative tasks.

EMControl: Adding Conditional Control to Text-to-Image Diffusion Models Via Expectation-Maximization

Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications. Despite recent advancements in achieving realistic lip motion, current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion. These limitations result in blunt and repetitive facial animations, reducing user engagement and hindering their applicability. To address these challenges, we introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs. To achieve this, we first train DEE (\textbf{D}ynamic \textbf{E}motion \textbf{E}mbedding), which employs probabilistic contrastive learning to forge a joint emotion embedding space for both speech and facial motion. This probabilistic framework captures the uncertainty in interpreting emotions from speech and facial motion, enabling the derivation of emotion vectors from its multifaceted space. Moreover, to generate dynamic facial motion, we design TH-VQVAE (\textbf{T}emporally \textbf{H}ierarchical VQ-VAE) as an expressive and robust motion prior overcoming limitations of VAEs and VQ-VAEs. Utilizing these strong priors, we develop DEEPTalk, A talking head generator that non-autoregressively predicts codebook indices to create dynamic facial motion, incorporating a novel emotion consistency loss. Extensive experiments on various datasets demonstrate the effectiveness of our approach in creating diverse, emotionally expressive talking faces that maintain accurate lip-sync. Source code will be made publicly available soon.

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

Given the large volume of side information from different modalities, multimodal recommender systems have become increasingly vital, as they exploit richer semantic information beyond user-item interactions. Recent works highlight that leveraging Graph Convolutional Networks (GCNs) to explicitly model multimodal item-item relations can significantly enhance recommendation performance. However, due to the inherent over-smoothing issue of GCNs, existing models benefit only from shallow GCNs with limited representation power. This drawback is especially pronounced when facing complex and high-dimensional patterns such as multimodal data, as it requires large-capacity models to accommodate complicated correlations. To this end, in this paper, we investigate bypassing GCNs when modeling multimodal item-item relationship. More specifically, we propose a Topology-aware Multi-Layer Perceptron (TMLP), which uses MLPs instead of GCNs to model the relationships between items. 
TMLP enhances MLPs with topological pruning to denoise item-item relations and intra (inter)-modality learning to integrate higher-order modality correlations.
Extensive experiments on three real-world datasets verify TMLP's superiority over nine baselines. We also find that by discarding the internal message passing in GCNs, which is sensitive to node connections, TMLP achieves significant improvements in both training efficiency and robustness against existing models.

Beyond Graph Convolution: Multimodal Recommendation with Topology-aware MLPs

With the successful transition of Transformers from NLP to CV domains, Vision-Transformers(ViTs) have achieved state-of-the-art performance in many CV tasks. However, backdoor attacks, a significant threat in deep learning, also pose a risk to the security of ViT models. Recently, several backdoor attack methods targeting the patch-level self-attention mechanism in ViTs have been proposed, but they are relatively naive in terms of stealthiness and robustness against defensive measures, lacking in-depth investigation. In this paper, we explore the crucial role of attention-level imperceptibility in backdoor attacks for ViTs and propose an Attention-Imperceptible Backdoor Attacks on Vision Transformers(AIBA). In AIBA, a constrained adversarial perturbation is used as the trigger to achieve visual imperceptibility. Additionally, the trigger is designed to seamlessly implant into the focal areas of the image, ensuring that the trigger receives enough attention from the model without causing anomalies at the attention level. During the backdoor learning process, we designed an efficient constrained bi-level optimization training strategy to implant an effective backdoor in the victim model using the imperceptible trigger. We evaluated the effectiveness of the proposed AIBA across multiple datasets and ViT benchmarks and explored the robustness of AIBA against current ViT-specific defense methods. The experimental results demonstrate that our backdoor attack method can successfully implant a powerful and stealthy backdoor into ViTs.

Attention-Imperceptible Backdoor Attacks on Vision Transformers

In recent years, Graph Neural Networks (GNNs) have achieved remarkable success in many graph mining tasks.
However, scaling them to large graphs is challenging due to the high computational and storage costs of repeated feature propagation and non-linear transformation during training.
One commonly employed approach to address this challenge is model-simplification, which only executes the $\textbf{P}$ropagation ($\textbf{P}$) once in the pre-processing, and  
$\textbf{C}$ombine ($\textbf{C}$) these receptive fields in different ways and then feed them into a simple model for better performance.
Despite their high predictive performance and scalability, these methods still face two limitations.
First, existing approaches mainly focus on exploring different $\textbf{C}$ methods from the model perspective, neglecting the crucial problem of performance degradation with increasing $\textbf{P}$ depth from the data-centric perspective, known as the over-smoothing problem.
Second, pre-processing overhead takes up most of the end-to-end processing time, especially for large-scale graphs.
To address these limitations, we present random walk with noise masking (RMask), a plug-and-play module compatible with the existing model-simplification works. 
This module enables the exploration of deeper GNNs while preserving their scalability.
Unlike the previous model-simplification works, we focus on continuous $\textbf{P}$ and found that the noise existing inside each $\textbf{P}$ is the cause of the over-smoothing issue, and use the efficient masking mechanism to eliminate them.
Experimental results on six real-world datasets demonstrate that model-simplification works equipped with RMask yield superior performance compared to their original version and can make a good trade-off between accuracy and efficiency.

Towards Scalable and Deep Graph Neural Networks via Noise Masking

Multimodal aspect-based sentiment analysis (MABSA) integrates text and images to perform fine-grained sentiment analysis on specific aspects, enhancing the understanding of user opinions in various applications. Existing methods use modality alignment for information interaction and fusion between images and text, but an inherent gap between these two modalities necessitates a more direct bridging mechanism to effectively connect image understanding with text content. For this, we propose the Descriptions Enhanced Question-Answering Framework (DEQA), which generates descriptions of images using GPT-4, leveraging the multimodal large language model to provide more direct semantic context of images. In DEQA, to help the model better understand the task's purpose, we frame MABSA as a multi-turn question-answering problem to add semantic guidance and hints. We input text, image, and description into separate experts in various combinations, allowing each expert to focus on different features and thereby improving the comprehensive utilization of input information. By integrating these expert outputs within a multi-turn question-answering format, we employ a multi-expert ensemble decision-making approach to produce the final prediction results. Experimental results on two widely-used datasets demonstrate that our method achieves state-of-the-art performance. Furthermore, our framework substantially outperforms GPT-4o and other multimodal large language models, showcasing its superior effectiveness in multimodal sentiment analysis.

DEQA: Descriptions Enhanced Question-Answering Framework for Multimodal Aspect-Based Sentiment Analysis

Federated Learning (FL) enables collaborative learning from distributed data while preserving the privacy of participating clients. While supervised federated learning with labeled data has made notable strides and achieved success, federated semi-supervised learning (FSSL) lags in its progress. Existing works for FSSL heavily rely on fully-labeled clients, while ignoring the distribution of pseudo-labels generated from skewed unlabeled data. In this work, we offer empirical and theoretical insights into the challenges encountered when applying conventional semi-supervised algorithms in the federated regime. Specifically, we highlight how the inherent data heterogeneity in FSSL can exacerbate issues within the pseudo-labeling process. Motivated by these observations, we propose federated learning with progressive distribution matching (FedPDM) to regularize the distribution of pseudo-labels, aiming to progressively reshape it to align with the ground-truth distribution. The matching problem could be formulated as an optimal transport (OT) problem and efficiently solved by Sinkhorn-Knopp iteration. Through extensive experiments, we demonstrate the superiority of FedPDM on a variety of models and datasets compared with prior arts for FSSL.

Progressive Distribution Matching for Federated Semi-Supervised Learning

In this paper, we link two existing approaches to derive counterfactuals: adaptations based on a causal graph, and optimal transport. We extend  "Knothe's rearrangement" and "triangular transport" to probabilistic graphical models, and use this counterfactual approach, referred to as sequential transport, to discuss individual fairness. After establishing the theoretical foundations of the proposed method, we demonstrate its application through numerical experiments on both synthetic and real datasets.

Sequential Conditional Transport on Probabilistic Graphs for Interpretable Counterfactual Fairness

Is there a foreign language describing protein sequences and structures simultaneously? Protein structures, represented by continuous 3D points, have long posed a challenge due to the contrasting modeling paradigms of discrete sequences. We introduce \textbf{FoldTokenizer} to represent protein sequence-structure as discrete symbols. This approach involves projecting residue types and structures into a discrete space, guided by a reconstruction loss for information preservation. We name the learned discrete symbols as \textbf{FoldToken}, and the sequence of FoldTokens serves as a new protein language, transforming the protein sequence-structure into a unified modality. We apply the created protein language on general backbone inpainting task, building the first GPT-style model (\textbf{FoldGPT}) for sequence-structure co-generation with promising results. Key to our success is the substantial enhancement of the vector quantization module, Soft Conditional Vector Quantization (\textbf{SoftCVQ}). Code will be released upon acceptance.

Premium content

Next from AAAI 2025

FCOM: A Federated Collaborative Online Monitoring Framework via Representation Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES