United States

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model&#39;s learned probabilities. 

The typical autoregressive decoding method requires a separate forward pass through the model for each token generated, which is computationally inefficient and poses challenges for deploying LLMs in latency-sensitive scenarios.

The main limitations of current decoding methods stem from their inefficiencies and resource demands. Existing approaches either necessitate fine-tuning smaller models, which is resource-intensive, or relying on fixed retrieval schemes to construct drafts for the next tokens, which lack adaptability and fail to generalize across different models and contexts.

To address these issues, we introduce a novel methodology called *ADED*, which accelerates LLM decoding without requiring fine-tuning. Our approach involves an adaptive draft-verification process that evolves over time to improve efficiency. We utilize a tri-gram matrix-based LLM representation to dynamically approximate the output distribution of the LLM, allowing the model to adjust to changing token probabilities during the decoding process. Additionally, we implement a draft construction mechanism that effectively balances exploration and exploitation, ensuring that the drafts generated are both diverse and close to the true output distribution of the LLM.

The importance of this design lies in its ability to optimize the draft distribution adaptively, leading to faster and more accurate decoding. Through extensive experiments on various benchmark datasets and LLM architectures, we demonstrate that \sysname significantly accelerates the decoding process while maintaining high accuracy, making it suitable for deployment in a wide range of practical applications.

AAAI 2025

Adaptive Draft-Verification for Efficient Large Language Model Decoding

learning optimization for snlp

snlp

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. 

The typical autoregressive decoding method requires a separate forward pass through the model for each token generated, which is computationally inefficient and poses challenges for deploying LLMs in latency-sensitive scenarios.

The main limitations of current decoding methods stem from their inefficiencies and resource demands. Existing approaches either necessitate fine-tuning smaller models, which is resource-intensive, or relying on fixed retrieval schemes to construct drafts for the next tokens, which lack adaptability and fail to generalize across different models and contexts.

To address these issues, we introduce a novel methodology called *ADED*, which accelerates LLM decoding without requiring fine-tuning. Our approach involves an adaptive draft-verification process that evolves over time to improve efficiency. We utilize a tri-gram matrix-based LLM representation to dynamically approximate the output distribution of the LLM, allowing the model to adjust to changing token probabilities during the decoding process. Additionally, we implement a draft construction mechanism that effectively balances exploration and exploitation, ensuring that the drafts generated are both diverse and close to the true output distribution of the LLM.

The importance of this design lies in its ability to optimize the draft distribution adaptively, leading to faster and more accurate decoding. Through extensive experiments on various benchmark datasets and LLM architectures, we demonstrate that \sysname significantly accelerates the decoding process while maintaining high accuracy, making it suitable for deployment in a wide range of practical applications.

technical paper

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



We present EvHDR-NeRF to recover a High Dynamic Range (HDR) radiance field from event streams and a set of low dynamic range (LDR) views with single exposures. Using the EvHDR-NeRF, we can generate both novel HDR views and novel LDR views under different exposures. The key to our method is to model the new relationship between events streams and LDR images, which considers both the Camera Response Function (CRF) and exposure time.  Based on this relationship, we categorize events into inter-frame events and intra-exposure. The former is utilized for building HDR radiance field and the latter is used to deblur potentially blurred images. Compared to existing methods, this method can effectively reconstruct the HDR radiance field even when the input images are degraded. Experimental results demonstrate that our method achieves state-of-the-art performance in HDR reconstruction, providing a more adaptable and accurate solution for complex imaging applications.

EvHDR-NeRF: Building High Dynamic Range Radiance Fields with Single Exposure Images and Events

Molecular dynamics (MD) has long been the \emph{de facto} choice for simulating intricate physical systems from first principles. Recent efforts utilize the implicit neural representation (INR) to directly learn surface point clouds' signed distance function (SDF) with promising outcomes. However, INR's temporal generalization to unexplored molecular systems remains limited, which poses a significant barrier to applying INR to a broader range of real-world scenarios. This study introduces MoE-DSR, an enhanced version of dynamic surface representations (DSR) that effectively integrates the mixture-of-experts (MoE) strategy. Specifically, the router employs a novel geometric surface cloud network to extract the structural information from the initial static protein conformation as the prior knowledge. Meanwhile, experts compromising a team of equivariant implicit neural networks (E-INNs), each responsible for distinct protein families, ensure precise SDF estimation across varied protein data landscapes. We showcase the ability of MoE-DSR to model dynamic protein surface shapes using ensembles from ATLAS, the largest available protein MD simulations database. Extensive experiments validate its effectiveness in analyzing complex molecular systems across continuous space and time domains.

Generalized Implicit Neural Representations for Dynamic Molecular Surface Modeling

Compared to conventional long-tail learning, which focuses on addressing class-wise imbalances, generalized long-tail (GLT) learning considers that samples within each class still conform to long-tailed distributions due to varying attributes, known as attribute imbalance. In the presence of such imbalance, the assumption of equivalence between the class-conditional probability densities of the training and testing sets is no longer tenable. Existing GLT approaches typically employ regularization techniques to avoid directly modeling the class-conditional probability density (CCPD) ratio between training and test data, leading to suboptimal performance. This study aims to directly estimate this ratio, for which a novel class-attribute aware logit-adjusted (CALA) loss incorporating both the CCPD ratio and the class priors is presented. Two new GLT learning methods, named Heuristic-CALA and Meta-CALA, are then proposed, which estimate the CCPD ratio in the CALA loss by leveraging the neighborhood information of samples. Extensive experiments across diverse scenarios susceptible to class and attribute imbalances showcase the state-of-the-art performance of Meta-CALA. Furthermore, while Heuristic-CALA exhibits inferior performance compared to Meta-CALA, it incurs only negligible additional training time compared to the Cross-Entropy loss, yet surpasses existing methods by a significant margin.

Class and Attribute-Aware Logit Adjustment for Generalized Long-Tail Learning

We introduce RealPortrait,  a framework based on Diffusion Transformers (DiT), designed to generate highly expressive and visually appealing portrait animations. Given a static portrait image, our method can transfer complex facial expressions and head pose movements extracted from a driving video onto the portrait, transforming it into a lifelike video.
Specifically, we exploit the robust spatial-temporal modeling capabilities of DiT, enabling the generation of portrait videos that maintain high-fidelity visual details and ensure temporal coherence. In contrast to conventional image-to-video generation frameworks that necessitate a separate reference network, we incorporate an efficient reference attention within the DiT backbone, thereby obviating the computational overhead and achieving superior reference appearance preservation.
Concurrently, we integrate a parallel ControlNet to precisely regulate intricate facial expressions and head poses. Diverging from prior methods that utilize explicit sparse motion representations, such as facial landmarks or 3DMM coefficients, we adopt a dense implicit motion representation as the control guidance. This implicit motion representation excels in capturing nuanced emotional facial expressions and subtle non-rigid dynamics of the lips.
To further enhance the generalization capability of the model, we augment the training dataset by incorporating a substantial volume of facial image data through random crop augmentation. This strategy ensures the model's robustness across a wide variety of facial appearances and expressions.
Empirical evaluations demonstrate that RealPortrait excels in generating portrait animations with hyper-realistic quality and exceptional temporal coherence in appearance retention. The framework effectively mitigates the uncanny valley effect, significantly narrowing the disparity between synthetic and real portrait animations.

RealPortrait: Realistic Portrait Animation with Diffusion Transformers

Golog is an expressive high-level agent language that includes nondeterministic operators which allow to leave some of the decisions to be made only at execution time. This so-called program realization is typically implemented by means of search, or in an incremental online fashion. In this paper, we consider the more realistic case where parts of the non-determinism are under the control of the environment. Program realization then becomes a synthesis problem, where a successful realization executes the program and satisfies the temporal goal for all possible environment actions. We consider Golog programs in combination with an expressive class of first-order action theories that allow for an unbounded number of objects and non-local effects, together with a temporal goal specified in a first-order extension of LTLf. We solve the synthesis problem by constructing a game arena that captures all possible executions of the program while tracking the satisfaction of the temporal goal and then solving the resulting two-player game. We evaluate the approach in two domains, showing the general feasibility of the approach.

LTLf Synthesis on First-Order Agent Programs in Nondeterministic Environments

This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL), wherein the model is trained to recognize multiple unseen classes within a sample (e.g., an image) based on seen classes and auxiliary knowledge, e.g., semantic information. Existing methods usually resort to analyzing the relationship of various seen classes residing in a sample from the dimension of spatial or semantic characteristics and transferring the learned model to unseen ones. However, they neglect the integrity of local and global features. Although the use of the attention structure will accurately locate local features, especially objects, it will significantly lose its integrity, and the relationship between classes will also be affected. Rough processing of global features will also directly affect comprehensiveness. This neglect will make the model lose its grasp of the main components of the image. Relying only on the local existence of seen classes during the inference stage introduces unavoidable bias. In this paper, we propose a novel and comprehensive visual-semantic framework for MLZSL, dubbed Epsilon, to fully make use of such properties and enable a more accurate and robust visual-semantic projection. In terms of spatial information, we achieve effective refinement by group aggregating image features into several semantic prompts. It can aggregate semantic information rather than class information, preserving the correlation between semantics. In terms of global semantics, we use global forward propagation to collect as much information as possible to ensure that semantics are not omitted. Experiments on large-scale MLZSL benchmark datasets NUS-Wide and Open-Images-v4 demonstrate that the proposed Epsilon outperforms other state-of-the-art methods with large margins.

Epsilon: Exploring Comprehensive Visual-Semantic Projection for Multi-Label Zero-Shot Learning

Reliable failure detection holds paramount importance in safety-critical applications.
Yet, neural networks are known to produce overconfident predictions for misclassified samples. 
As a result, it remains a problematic matter as existing confidence score functions rely on category-level signals, the logits, to detect failures. 
This research introduces an innovative strategy, leveraging human-level concepts for a dual purpose: to reliably detect $\textit{when}$ a model fails and to transparently interpret $\textit{why}$.
By integrating a nuanced array of signals for each category, our method enables a finer-grained assessment of the model's confidence.
We present a simple yet highly effective approach based on the ordinal ranking of concept activation to the input image. 
Without bells and whistles, our method is able to significantly reduce the false positive rate across diverse real-world image classification benchmarks, specifically by $3.7$% on $\textit{ImageNet}$ and $9.0$% on $\textit{EuroSAT}$.

Interpretable Failure Detection with Human-Level Concepts

The accurate assessment of sperm morphology is crucial in andrological diagnostics, where the segmentation of sperm images presents significant challenges. Existing approaches frequently rely on large annotated datasets and often struggle with the segmentation of overlapping sperm and the presence of dye impurities. To address these challenges, this paper first analyzes the issue of overlapping sperm tails from a geometric perspective and introduces a novel clustering algorithm, Con2Dis, which effectively segments overlapping tails by considering three essential factors: CONnectivity, CONformity, and DIStance. Building on this foundation, we propose an unsupervised method, SpeHeaTal, designed for the comprehensive segmentation of the SPErm HEAd and TAiL. SpeHeaTal employs the Segment Anything Model (SAM) to generate masks for sperm heads while filtering out dye impurities, utilizes Con2Dis to segment tails, and then applies a tailored mask splicing technique to produce complete sperm masks. Experimental results underscore the superior performance of SpeHeaTal, particularly in handling images with overlapping sperm.

SpeHeaTal: A Cluster-Enhanced Segmentation Method for Sperm Morphology Analysis

Self-supervised stereo matching has drawn attention due to its ability to estimate disparity without needing ground-truth data.
However, existing self-supervised stereo matching methods heavily rely on the photo-metric consistency assumption, which is vulnerable to natural disturbances, resulting in ambiguous supervision and inferior performance compared to the supervised ones.
To relax the limitation of the photo-metric consistency assumption and even bypass this assumption, we propose a novel self-supervised framework named \textbf{\textit{DualNet}}, which consists of two key steps: robust self-supervised teacher learning and pseudo-label supervised student training.
Specifically, the teacher model is first trained in a self-supervised manner with a focus on feature-metric consistency and data augmentation consistency.
Then, the output of the teacher model is geometrically constrained to obtain high-quality pseudo labels. 
Benefiting from these high-quality pseudo labels, the student model can outperform its teacher model by a large margin.
With the two well-designed steps, the proposed framework
\textbf{\textit{DualNet}} ranks $1^{st}$ among all self-supervised methods on multiple benchmarks, surprisingly even outperforming several supervised counterparts.

DualNet: Robust Self-Supervised Stereo Matching with Pseudo-Label Supervision

Logistics and transportation networks require a large amount of resources to realise necessary connections between locations and minimizing these resources is a vital aspect of planning research. 
Since such networks have dynamic connections that are available only at specific times, intricate models are needed to portray them accurately. 
In this paper, we study the problem of minimizing the number of resources needed to realise a dynamic network, using the temporal graphs model in which edges appear only at specific points in time.
Given a temporal graph and a natural number $k$, we ask whether we can cover every temporal edge exactly once using at most $k$ temporal journeys; in a temporal journey consecutive edges have to adhere to the order of time.
We conduct a thorough investigation of the complexity of the problem with respect to four dimensions: 
(a) whether the type of the temporal journeys is a walk, a trail, or a path; 
(b) whether the chronological order of edges in the journey is strict or non-strict; 
(c) whether the temporal graph is directed or undirected; 
(d) whether the start and end points of each journey is given or not.
We almost completely resolve the complexity of all these problems and provide dichotomies for each one of them with respect to $k$.

Premium content

Next from AAAI 2025

EvHDR-NeRF: Building High Dynamic Range Radiance Fields with Single Exposure Images and Events

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES