United States

The substantial computational and memory demands of Large Language Models (LLMs) present barriers to their deployment. Block Floating Point (BFP) has been instrumental in accelerating linear operations, which are fundamental to LLM workloads. However, as the sequence length of LLMs increases, nonlinear operations have increasingly become performance bottlenecks, with Attention being a typical example due to its computational complexity scaling quadratically with input length. These nonlinear operations continue to be predominantly executed using inefficient floating-point formats, which renders the system challenging to optimize software efficiency and hardware overhead. In this paper, we delve into the limitations and potential of applying BFP to nonlinear operations. Given our findings, we introduce a novel hardware-software co-design framework (DB-Attn), including: (i) DBFP, an advanced BFP version, overcomes nonlinear operation challenges with a pivot-focus strategy for diverse data and an adaptive grouping strategy for flexible exponent sharing. (ii) DH-LUT, a novel lookup table algorithm dedicated to accelerating nonlinear operations with DBFP format. (iii) An RTL-level DBFP-based engine is implemented to support DB-Attn, applicable to FPGA and ASIC. Results show that DB-Attn provides significant performance improvements with negligible accuracy loss, achieving 74% GPU speedup on Softmax of LLaMA and 10x low-overhead performance improvement over SOTA ASIC designs.

AAAI 2025

Pushing the Limits of BFP on Narrow Precision LLM Inference

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Identifying the causal pathways of unfairness is a critical objective in improving policy design and algorithmic decision-making. However, prior work in causal fairness analysis requires knowledge of the causal graph, hindering practical applications in complex or low-knowledge domains. Moreover, relying on global discovery methods to learn causal structure from data can result in unstable performance with finite samples, potentially leading to contradictory fairness conclusions. To mitigate these issues, we introduce *local discovery for direct discrimination* (LD3): an algorithm tailored to uncover structural evidence of direct discrimination by identifying the causal parents of an outcome variable. LD3 performs a linear number of conditional independence tests relative to variable set size, and allows for latent confounding under the sufficient condition that no parent of the outcome is latent. LD3 prevents unnecessary adjustment, resulting in more interpretable adjustment sets for assessing unfairness.  We introduce a graphical criterion for identifying the *weighted controlled direct effect* (WCDE), a qualitative indicator of direct discrimination, and show that the knowledge returned by LD3 satisfies this criterion. We deploy LD3 for causal fairness analyses of two complex decision systems: criminal recidivism prediction and liver transplant allocation. Results on real-world data demonstrate more plausible causal relations than baselines, which took 46$\times$ to 5870$\times$ longer to execute.

Local Causal Discovery for Structural Evidence of Direct Discrimination

Beginner musicians often struggle to identify specific errors in their performances, such as playing incorrect notes or rhythms. 
There are two limitations in existing tools for music error detection: (1) Existing approaches rely on automatic alignment, which is error-prone due to small deviations between alignment targets; (2) There is a lack of sufficient data to train music error detection models, resulting in over-reliance on heuristics. 
To address (1), we propose a novel transformer model, \textit{Muse}, that takes audio inputs and outputs annotated music scores. 
This model can be trained end-to-end to implicitly align and compare performance audio with music scores through latent space representations. 
To address (2), we present a novel data generation technique capable of creating large-scale synthetic music error datasets. Our approach achieves a 64.1\% average Error Detection F1 score, improving upon prior work by 40 percentage points across 14 instruments. Compared with existing transcription methods repurposed for music error detection, our model can handle multiple instruments. This allows the model to scale across multiple instruments and generalize across datasets like MAESTRO and CocoChorales.

Detecting Music Performance Errors with Transformers

Large-scale text-to-image diffusion models, (e.g., DALL-E, SDXL) are capable of generating famous persons by simply referring to their names. $\textbf{\textit{Is it possible to make such models generate generic identities as simple as the famous ones, e.g., just use a name?}} $ In this paper, we explore the existence of a ``Name Space'', where any point in the space corresponds to a specific identity. Fortunately, we find some clues in the feature space spanned by text embedding of celebrities' names. Specifically, we first extract the embeddings of celebrities' names in the Laion5B dataset with the text encoder of diffusion models. Such embeddings are used as supervision to learn an encoder that can predict the name (actually an embedding) of a given face image. We experimentally find that such name embeddings work well in promising the generated image with good identity consistency. Note that like the names of celebrities, our predicted name embeddings are disentangled from the semantics of text inputs, making the original generation capability of text-to-image models well-preserved. Moreover, by simply plugging such name embeddings, all variants (e.g., from Civitai) derived from the same base model (i.e., SDXL) readily become identity-aware text-to-image models.

MagicNaming: Consistent Identity Generation by Finding a “Name Space” in T2I Diffusion Models

We study a hinted heterogeneous multi-agent multi-armed bandits problem $\texttt{HMA2B}$, where agents can query low-cost observations (hints) in addition to pulling arms. In this framework, each of the $M$ agents has a unique reward distribution over $K$ arms, and in $T$ rounds, they can observe the reward of the arm they pull only if no other agent pulls that arm.
The goal is to maximize the total utility by querying the minimal necessary hints without pulling arms, achieving time-independent regret. We study $\texttt{HMA2B}$ in both centralized and decentralized setups. Centralized algorithms $\texttt{HCLA}$ and $\texttt{GP-HCLA}$ use a central decision-maker for arm-pulling and hint queries, based on empirical means and $\texttt{kl-UCB}$ index, respectively, achieving $O(M^4K)$ regret with $O(MK\log T)$ adaptive hints. In decentralized setups, we propose two algorithms, $\texttt{HD-ETC}$ and $\texttt{EBHD-ETC}$, that allow agents to choose actions independently through collision-based communication and query hints uniformly until stopping, yielding $O(M^3K^2)$ regret with $O(M^3K\log T)$ hints. 
$\texttt{HD-ETC}$ stops hinting based on an assumed minimum gap $ \Delta_{\min}^\text{G} $, while $\texttt{EBHD-ETC}$ adapts hinting without knowing $ \Delta_{\min}^\text{G} $. 
Last, we establish lower bounds to prove the optimality of our results and confirm them through numerical simulations.

Heterogeneous Multi-Agent Bandits with Parsimonious Hints

Graph Masked AutoEncoder (GMAE) has recently attracted vast interest in handling graph-related tasks by adopting the masking-reconstruction learning paradigm. Most existing GMAE-based methods adhere to the homophily assumption, i.e., connected nodes share the same attributes. However, this assumption is not always right because most graphs from real-world applications are mixed by both homophilic and heterophilic edges. Therefore, it is necessary to distinguish them to improve the representative ability of GMAE. In this paper, we propose a $\textbf{T}$eacher-guided $\textbf{E}$dge $\textbf{D}$iscriminator for personalized graph $\textbf{M}$asked $\textbf{A}$uto$\textbf{E}$ncoder ($\textbf{TEDMAE}$). Specifically, we design a teacher-guided edge discriminator that distinguishes homophilic and heterophilic edges by leveraging the embeddings from teacher models with structure and attribute knowledge. Then, we present a personalized graph masked autoencoder that individually tailors the masking, encoding, and reconstruction processes for each graph. Finally, we optimize the model by minimizing two types of loss functions, i.e., the scaled cosine error (SCE) loss and the InfoNCE loss. Experimental results on 10 datasets demonstrate TEDMAE's superior performance on the tasks of node classification and node clustering.

Teacher-guided Edge Discriminator for Personalized Graph Masked Autoencoder

In real-life scenarios, a Reinforcement Learning (RL) agent aiming to maximize their reward, must often also behave in a safe manner, including at training time. Thus, much attention in recent years has been given to Safe RL, where an agent aims to learn an optimal policy among all policies that satisfy a given safety constraint. However, strict safety guarantees are often provided through approaches based on linear programming, and thus have limited scaling. In this paper we present a new, scalable method, which enjoys strict formal guarantees for Safe RL, in the case where the safety dynamics of the Markov Decision Process (MDP) are known, and safety is defined as an undiscounted probabilistic avoidance property. Our approach is based on state-augmentation of the MDP, and on the design of a shield that restricts the actions available to the agent. We show that our approach provides a strict formal safety guarantee that the agent stays safe at training and test time. Furthermore, we demonstrate that our approach is viable in practice through experimental evaluation.

Probabilistic Shielding for Safe Reinforcement Learning

In recent years, semantic segmentation has flourished in various applications.
However, the high computational cost remains a significant challenge that hinders its further adoption. 
The filter pruning method for structured network slimming offers a direct and effective solution for the reduction of segmentation networks. 
Nevertheless, we argue that the majority of existing pruning methods overlook the fact that segmentation is a location-sensitive task, which consequently leads to the sub-optimal performance of existing pruning methods originally designed for image classification when applied to segmentation networks. 
To address this issue, this paper proposes a novel approach, denoted as Spatial-aware Information Redundancy Filter Pruning (SIRFP), which aims to reduce feature redundancy between channels. 
First, we formulate the pruning problem as a maximum edge weight clique problem (MEWCP) in graph theory, thereby minimizing the feature redundancy among the remaining features after pruning. 
Within this framework, we introduce a spatial-aware redundancy metric based on feature maps into the consideration of segmentation network pruning, thus endowing the pruning process with location sensitivity to better adapt to segmentation tasks. 
Additionally, based on the MEWCP, we propose a low computational complexity greedy strategy to solve this NP-hard problem, making it feasible and efficient for structured pruning. 
To validate the effectiveness of our method, we conducted extensive comparative experiments on various challenging datasets.
The results demonstrate the superior performance of SIRFP for semantic segmentation tasks.

Structural Pruning via Spatial-aware Information Redundancy for Semantic Segmentation

Open-Vocabulary Detection (OVD) aims to detect objects from novel categories beyond the base categories on which the detector is trained. However, existing open-vocabulary detectors trained on known category data tend to assign higher confidence to trained categories and confuse novel categories with background. To resolve this, we propose OV-DQUO, an Open-Vocabulary DETR with Denoising text Query training and open-world Unknown Objects supervision. Specifically, we introduce a wildcard matching method that enables the detector to learn from pairs of unknown objects recognized by the open-world detector and text embeddings with general semantics, mitigating the confidence bias between base and novel categories. Additionally, we propose a denoising text query training strategy that synthesizes additional noisy query-box pairs from open-world unknown objects to train the detector through contrastive learning, enhancing its ability to distinguish novel objects from the background. We conducted extensive experiments on the challenging OV-COCO and OV-LVIS benchmarks, achieving new state-of-the-art results of 45.6 AP50 and 39.3 mAP on novel categories respectively, without the need for additional training data. The code is available at supplementary materials.

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

Multi-label metric learning, as an extension of metric learning to multi-label scenarios, aims to learn better similarity metrics for objects with rich semantics. Existing multi-label metric learning approaches employ the common assumption of equal labeling-importance, i.e., all associated labels are considered relevant to the training instance, while there is no differentiation in the relative importance of their semantics. However, this common assumption does not reflect the fact that the importance of each relevant label is generally different, even though such importance information is not directly accessible from the training examples. In this paper, we claim that it is beneficial to leverage the implicit Relative LabelingImportance (RLI) information to facilitate multi-label metric learning. Specifically, the manifold structure within the feature space is exploited by local linear reconstruction, and then the RLIs are recovered by transferring such structure to the label space. Subsequently, a discrimiative multi-label metric learning framework is introduced to align the predictive modeling outputs with the recovered RLIs, under which instances with similar RLI are implicitly pulled closer to each other, while those with dissimilar RLI are pushed further apart. Comprehensive experiments on benchmark multi-label datasets validate the superiority of our proposed approach in learning effective similarity metrics between multi-label examples.

Implicit Relative Labeling-Importance Aware Multi-Label Metric Learning

Current knowledge distillation (KD) methods for semantic segmentation focus on guiding the student to imitate the teacher's knowledge within homogeneous architectures. However, these methods overlook the diverse knowledge contained in architectures with different inductive biases, which is crucial for enabling the student to acquire a more precise and comprehensive understanding of the data during distillation. To this end, we propose for the first time a generic knowledge distillation method for semantic segmentation from a heterogeneous perspective, named HeteroAKD. Due to the substantial disparities between heterogeneous architectures, such as CNN and Transformer, directly transferring cross-architecture knowledge presents significant challenges. To eliminate the influence of architecture-specific information, the intermediate features of both the teacher and student are skillfully projected into an aligned logits space. Furthermore, to utilize diverse knowledge from heterogeneous architectures and deliver customized knowledge required by the student, a teacher-student knowledge mixing mechanism (KMM) and a teacher-student knowledge evaluation mechanism (KEM) are introduced. These mechanisms are performed by assessing the reliability and its discrepancy between heterogeneous teacher-student knowledge. Extensive experiments conducted on three main-stream benchmarks using various teacher-student pairs demonstrate that our HeteroAKD framework outperforms state-of-the-art KD methods in facilitating distillation between heterogeneous architectures.

Premium content

Next from AAAI 2025

Local Causal Discovery for Structural Evidence of Direct Discrimination

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES