United States

Deployment of Large Language Models (LLMs) has major computational costs, due to their rapidly expanding size.  Compression of LLMs reduces the memory footprint, latency, and energy required for their inference. Post-training Quantization (PTQ) techniques have been developed to compress LLMs while avoiding expensive re-training. Most PTQ approaches formulate the quantization error based on a layer-wise $\ell_2$ loss, ignoring the model output. Then, each layer is calibrated using its layer-wise Hessian to update the weights towards minimizing the $\ell_2$ quantization error. The Hessian is also used for detecting the most salient weights to quantization. Such PTQ  approaches are prone to accuracy drop in low-precision quantization. We propose Output-adaptive Calibration (OAC) to incorporate the model output in the calibration process. We formulate the quantization error based on the distortion of the output cross-entropy loss. OAC approximates the output-adaptive Hessian for each layer under reasonable assumptions to reduce the computational complexity. The output-adaptive Hessians are used to update the weight matrices and detect the salient weights towards maintaining the model output. Our proposed method outperforms the state-of-the-art baselines such as SpQR and BiLLM, especially, at extreme low-precision (2-bit and binary) quantization.

AAAI 2025

OAC: Output-adaptive Calibration for Accurate Post-training Quantization

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Instruction tuning has emerged as an effective approach that notably improves large language models (LLMs) performance, showing particular promise in natural language generation tasks by producing more diverse, coherent, and task-relevant outputs. However, extending instruction tuning to natural language understanding (NLU) tasks presents significant challenges, primarily due to the difficulty in achieving high-precision responses and the scarcity of large-scale, high-quality instruction data necessary for effective tuning. In this work, we introduce **Adversarial Noisy Instruction Tuning** (ANIT) to improve NLU performance on LLMs. First, we leverage low-resource techniques to construct noisy instruction datasets. Second, we employ semantic distortion-aware techniques to quantify the intensity of noise within these instructions. Last, we devise an adversarial training method that incorporates noise response strategy to achieve noisy instruction tuning. ANIT enhances LLMs capacity to detect and accommodate semantic distortions in noisy instructions, thereby augmenting their comprehension of task objectives and ability to generate more accurate responses. We evaluate our approach across the diverse spectrum of noisy instructions and semantic distortion quantification methods on multiple NLU tasks. Comprehensive empirical results demonstrate that our method consistently outperforms existing approaches across various experimental settings.

Enhancing NLU in Large Language Models Using Adversarial Noisy Instruction Tuning

Micro-video popularity prediction (MVPP) plays a crucial role in various real-world applications, including product marketing and recommendation systems. Recently, multimodal fusion methods that integrate multiple modalities to assess the popularity have exhibited impressive performance. However, these methods face several unresolved issues: (1) limited contextual information and (2) incomplete modal semantics. Incorporating relevant videos and performing full fine-tuning on pre-trained models typically achieves powerful capabilities in addressing these issues. However, this paradigm is not optimal due to its weak transferability and scarce downstream data. Inspired by prompt learning in the language domain, we propose ICPF, a novel In-Context Prompt-augmented Framework to enhance popularity prediction. ICPF maintains a model-agnostic design, facilitating seamless integration with various multimodal fusion models. Specifically, the multi-branch retriever first retrieves similar modal content through within-modality similarities. Next, in-context prompt generator extracts semantic prior features of retrieved videos, producing in-context prompts that enhance pre-trained models with rich contextual knowledge. Finally, knowledge-augmented predictor captures complementary features including modal semantics and popularity information. Extensive experiments conducted on three datasets demonstrate the superiority of ICPF compared to 14 competitive baselines, while training only 4\% of model parameters.

In-context Prompt-augmented Micro-video Popularity Prediction

In this paper, we consider the k-center problem with outliers (the (k,z)-center problem) in the context of Massively Parallel Computation (MPC). Existing MPC algorithms for the (k,z)-center problem typically require 𝜴(k) local space per machine. While this may be feasible when k is small, these algorithms become impractical for large k, where each machine may lack sufficient space for computation. This motivates the study of fully-scalable algorithms with sublinear local space.  We propose the first fully-scalable MPC algorithm for the (k,z)-center problem. The main challenge is to design an MPC algorithm that operates with sublinear local space for finding the inliers close to the optimal clustering centers, and ensuring the approximation loss remains bounded. To address this issue, we propose an iterative sampling-based algorithms with sublinear local space in the data size. A key component of our approach is an outliers-removal algorithm that adjusts the sample size in each iteration to select inliers as clustering centers. However, the number of discarded outliers increases with the iteration of the outliers-removal algorithm, making it difficult to bound.  To address this, we propose a self-adaptive method that can automatically adjust sample size to account for different outliers distributions on each machine, ensuring a lower bound on the sampling success probability. With these two techniques, we present an O(log*n)-approximation MPC algorithm for the (k,z)-center problem in constant-dimensional Euclidean space. The algorithm opens k(1+o(1)) centers and discards at most (1+ε)z outliers, completing in O(loglog n) computation rounds while using O(n^𝛿) local space per machine.

Fully-Scalable Massively Parallel Algorithm for k-center with Outliers

Document summarization has greatly benefited from advances in large language models (LLMs). In real-world situations, summaries often need to be generated from multiple documents with diverse sources and authors, lacking a clear information flow. Naively concatenating these documents and generating a summary can lead to poorly structured narratives and redundancy. Additionally, attributing each part of the generated summary to a specific source is crucial for reliability. In this study, we address multi-document summarization with attribution using our proposed solution ***MiDAS-PRo***, consisting of three stages: (i) Planning the hierarchical organization of source documents, (ii) Reasoning by generating relevant entities/topics, and (iii) Summary Generation. We treat the first two sub-problems as a code completion task for LLMs. By incorporating well-selected in-context learning examples through a graph attention network, LLMs effectively generate plans and reason topics for a document collection. Experiments on summarizing scientific articles from public datasets show that our approach outperforms state-of-the-art baselines in both automated and human evaluations.

Language Models of Code Are Few-Shot Planners and Reasoners for Multi-Document Summarization with Attribution

Measuring the similarity of the internal representations of deep neural networks is an important and challenging problem.
Model stitching has been proposed as a possible approach, where two half-networks are connected by mapping the output of the first half-network to the input of the second one. The representations are considered functionally similar if the resulting stitched network achieves good task-specific performance. The mapping is normally created by training an affine stitching layer on the task at hand while freezing the two half-networks, a method called task loss matching. Here, we argue that task loss matching may be very misleading as a similarity measure. For example, it can indicate very high similarity between very distant layers, whose representations are known to have different functional properties. Moreover, it can indicate very distant layers to be more similar than architecturally corresponding layers. Even more surprisingly, when comparing layers within the same network, task loss matching often indicates that some layers are more similar to a layer than itself. We argue that the main reason behind these problems is that task loss matching tends to create out-of-distribution representations to improve task-specific performance. We demonstrate that direct matching (when the mapping minimizes the distance between the stitched representations) does not suffer from these problems. We compare task loss matching, direct matching, and well-known similarity metrics such as CCA and CKA. We conclude that direct matching strikes a good balance between the structural and functional requirements for a good similarity measure.

How Not to Stitch Representations to Measure Similarity: Task Loss Matching Versus Direct Matching

Code benchmarks such as HumanEval are widely adopted to evaluate the capabilities of Large Language Models (LLMs), providing insights into their strengths and weaknesses. However, current benchmarks primarily exercise LLMs' capability on common coding tasks (e.g., bubble sort, greatest common divisor), leaving **domain-specific coding tasks** (e.g., computation, system, cryptography)unexplored. To fill this gap, we propose a multi-domain code benchmark, DOMAINEVAL, designed to evaluate LLMs' coding capabilities thoroughly. Our pipeline works in a fully automated manner, enabling a push-bottom construction from code repositories into formatted subjects under study. Interesting findings are observed by evaluating 12 representative LLMs against DOMAINEVAL. We notice that LLMs are generally good at **computation** tasks while falling short on **cryptography and system** coding tasks. The performance gap can be as much as 68.94% (80.94% - 12.0%) in some LLMs. We also observe that generating more samples can increase the overall performance of LLMs, while the domain bias may even increase. The contributions of this study include a code generation benchmark dataset DOMAINEVAL, encompassing six popular domains, a fully automated pipeline for constructing code benchmarks, and an identification of the limitations of LLMs in code generation tasks based on their performance on DOMAINEVAL, providing directions for future research improvements.

DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation

Information retrieval methods often rely on a single embedding model trained on large, general-domain datasets like MSMARCO. While this approach can yield a retriever with reasonable overall performance, training a model on domain-specific data can yield better results within that domain. While prior work in information retrieval has tackled this problem via multi-task training or providing domain knowledge to an instruction-following retriever, the topic of combining different expert domain-specific retrievers has remained unexplored, despite its popularity in language model generation settings. In this work, we introduce RouterRetriever, a retrieval model that leverages multiple domain-specific experts and a routing mechanism to select the most appropriate expert for each query. It is lightweight and allows easy addition or removal of gates without additional training. Evaluation on the BEIR benchmark demonstrates that RouterRetriever outperforms both MSMARCO-trained (+2.1 absolute nDCG@10) and multi-task trained (+3.2) models. To achieve this, we developed a routing mechanism that shows higher performance over other routing techniques (+1.8 on average) that have been successfully employed in language modeling settings. Moreover, the benefit generalizes to other datasets even when there are no experts for that dataset. RouterRetriever is the first work to demonstrate the benefits of using multiple domain-specific expert embedding models with effective routing techniques compared to relying on a single embedding model for all domains in retrieval tasks.

RouterRetriever: Routing over a Mixture of Expert Embedding Models

In recent years, as data and problem sizes have increased, distributed learning has become an essential tool for training high-performance models. However, the communication bottleneck, especially for high-dimensional data, is a challenge. Several techniques have been developed to overcome this problem. These include communication compression and the implementation of local steps, which work particularly well when there is similarity of local data samples. In this paper, we study the synergy of these two approaches for efficient distributed optimization. Using variance reduction and error feedback frameworks, we present the first theoretically grounded accelerated algorithms with unbiased and biased compression for distributed problems under similarity. In terms of communicated time our theory gives $\tilde{\mathcal{O}} \left(  1+\left[ M^{-\frac{1}{4}} + \omega^{-\frac{1}{2}} \right]\sqrt{\frac{\delta}{\mu}}  \right)$ complexity for unbiased compressors and $\tilde{\mathcal{O}}\left(1+\beta^{\frac{1}{4}}\sqrt{\frac{\delta}{\mu}}\right)$ for biased ones, where $M$ is the number of computational nodes, $\beta$ is the compression power, $\delta$ is the similarity measure and $\mu$ is the parameter of strong convexity of objective. Our theoretical results are of record and confirmed by experiments on different average losses and datasets.

Accelerated Methods with Compressed Communications for Distributed Optimization Problems Under Data Similarity

Graph neural networks(GNNs) have been demonstrated to depend on whether the node effective information is sufficiently passing. Discrete curvature (*Ricci curvature*) is used to study graph connectivity and information propagation efficiency with a geometric perspective, and has been raised in recent years to explore the efficient message-passing structure of GNNs. However, most empirical studies are based on directly observed graph structures or heuristic topological assumptions, and lack in-depth exploration of underlying optimal information transport structures for downstream tasks. We suggest that graph curvature optimization is more in-depth and essential than directly rewiring or learning for graph structure with richer message-passing characterization and better information transport interpretability. From both graph geometry and information theory perspectives, we propose the novel Discrete **Curv**ature **G**raph **I**nformation **B**ottleneck (**CurvGIB**) framework to optimize the information transport structure and learn better node representations simultaneously. CurvGIB advances the *Variational Information Bottleneck* (*VIB*) principle for Ricci curvature optimization to learn the optimal information transport pattern for specific downstream tasks. The learned Ricci curvature is used to refine the optimal transport structure of the graph, and the node representation is fully and efficiently learned. Moreover, for the computational complexity of Ricci curvature differentiation, we combine *Ricci flow* and *VIB* to deduce a curvature optimization approximation to form a tractable IB objective function. Extensive experiments on various datasets demonstrate the superior effectiveness and interpretability of CurvGIB.

Discrete Curvature Graph Information Bottleneck

The study of enhancing the robustness against adversarial examples has always been a topic of widespread interest, leading to the development of numerous adversarial defense techniques. These methods aim to mitigate the effects of deliberately introduced perturbations in input data designed to deceive models and reduce their accuracy. Evaluating the effectiveness of these defense strategies poses a significant challenge. The recently introduced AutoAttack technique has been recognized as a standardized method for assessing model robustness. However, the computational demands of the AutoAttack method significantly limit its applicability, underscoring the urgent need for efficient evaluation techniques. Our research indicates that relaxing constraints at specific stages of the attack can lead to the development of models capable of executing more efficient and powerful attacks on deep neural networks. We further introduce an attack method that approximates the size of perturbations from the outside and propose the Constraint Relaxation (CR) attack method. Based on experiments with 105 robust models, our approach demonstrates superiority over AutoAttack in terms of attack success rate, achieving a significant acceleration of 38.3 times in forward propagation and 15.9 times in backward propagation. Additionally, our ablation experiments highlight the significant effectiveness of the constraint relaxation method.

Premium content

Next from AAAI 2025

Enhancing NLU in Large Language Models Using Adversarial Noisy Instruction Tuning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES