United States

This paper proposes a sensitivity analysis framework based on set-valued mapping for deep neural networks (DNN) to understand and compute how the solutions (model weights) of DNN respond to perturbations in the training data. As a DNN may not exhibit a unique solution (minima) and the algorithm of solving a DNN may lead to different solutions with minor perturbations to input data, we focus on the sensitivity of the solution set of DNN, instead of studying a single solution. In particular, we are interested in the expansion and contraction of the solution set in response to data perturbations. If the change of solution set can be bounded by the extent of the data perturbation, the model is said to exhibit the Lipschitz-like property. This &#39;set-to-set&#39; analysis approach provides a deeper understanding of the robustness and reliability of DNNs during training. Our framework incorporates both isolated and non-isolated minima, and critically, does not require the assumption that the Hessian of loss function is non-singular. By developing set-level metrics such as distance between sets, convergence of sets, derivatives of set-valued mapping, and stability across the solution set, we prove that the solution set of the Fully Connected Neural Network holds Lipschitz-like properties. For general neural networks (e.g. Resnet), we introduce a graphical-derivative-based method to estimate the new solution set following data perturbation without retraining.

AAAI 2025

Set-Valued Sensitivity Analysis of Deep Neural Networks

deep neural architectures and foundation models

This paper proposes a sensitivity analysis framework based on set-valued mapping for deep neural networks (DNN) to understand and compute how the solutions (model weights) of DNN respond to perturbations in the training data. As a DNN may not exhibit a unique solution (minima) and the algorithm of solving a DNN may lead to different solutions with minor perturbations to input data, we focus on the sensitivity of the solution set of DNN, instead of studying a single solution. In particular, we are interested in the expansion and contraction of the solution set in response to data perturbations. If the change of solution set can be bounded by the extent of the data perturbation, the model is said to exhibit the Lipschitz-like property. This 'set-to-set' analysis approach provides a deeper understanding of the robustness and reliability of DNNs during training. Our framework incorporates both isolated and non-isolated minima, and critically, does not require the assumption that the Hessian of loss function is non-singular. By developing set-level metrics such as distance between sets, convergence of sets, derivatives of set-valued mapping, and stability across the solution set, we prove that the solution set of the Fully Connected Neural Network holds Lipschitz-like properties. For general neural networks (e.g. Resnet), we introduce a graphical-derivative-based method to estimate the new solution set following data perturbation without retraining.

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Time series data is essential in various applications, including climate modeling, healthcare monitoring, and financial analytics. Understanding the contextual information associated with real-world time series data is often essential for accurate and reliable event predictions. In this paper, we introduce TimeCAP, a time-series processing framework that creatively employs Large Language Models (LLMs) as contextualizers of time series data, extending their typical usage as predictors. Specifically, it incorporates two independent LLM agents: one generates a textual summary that captures the context of the time series, while the other uses this enriched summary to make more informed predictions. In addition, TimeCAP employs a multi-modal encoder that synergizes with the LLM agents, enhancing predictive performance through mutual augmentation of inputs with in-context examples. Experimental results on real-world datasets demonstrate that TimeCAP outperforms state-of-the-art time series event prediction methods, including those utilizing LLMs as predictors, achieving an average improvement of 28.75% in F1 score.

TimeCAP: Learning to Contextualize, Augment, and Predict Time Series Events with Large Language Model Agents

Advancements in hardware accelerators, such as graphics processing units and neural processing units, have significantly propelled computer vision research. The vision transformer (ViT), leveraging the multi-head self-attention (MHSA) mechanism, has surpassed convolutional neural networks (CNNs) in accuracy but faces challenges in mobile and edge deployment due to its large size and computational demands. In addition, as privacy concerns push for on-device training, research on quantization methods for ViTs, particularly gradient quantization, has gained attention. Unlike CNNs, ViTs face challenges due to outliers and a complex loss landscape. To address this, we propose a gradient quantization framework that stabilizes training by adapting quantization points based on interquartile ranges and constructing an outlier-robust loss function. Additionally, we employ a scaling method to align quantized gradients with original gradients and adaptively assign the learning rate based on quantization error analysis. When quantizing weights, activations, and gradients to INT8, our method improves performance by 0.52\% and 0.21\% over DeiT-Base and Swin-Base, respectively, and achieves near parity with MobileViT-S with only a 0.09\% accuracy drop. Furthermore, a 2.06$\times$ speedup was observed when applying our framework to MobileViT in a CUDA 11.8 environment.

GradQ-ViT: Robust and Efficient Gradient Quantization for Vision Transformers

Pairwise learning includes various machine learning tasks, with ranking and metric learning serving as the primary representatives. While randomized coordinate descent (RCD) is popular in various problems, there is much less theoretical analysis on the generalization behavior of models trained by RCD, especially under the pairwise learning framework. In this paper, we consider the generalization of RCD for pairwise learning. We measure the on-average argument stability for both convex and strongly convex objective functions, based on which we develop generalization bounds in expectation. The early-stopping strategy is adopted to quantify the balance between estimation and optimization. Our analysis further incorporates the low-noise setting into the excess risk bounds to achieve the optimistic bound as $O(1/n)$, where $n$ is the sample size. High-probability generalization bounds are also given via uniform stability, which imply better applications of RCD for pairwise learning problems.

Stability-based Generalization Analysis of Randomized Coordinate Descent for Pairwise Learning

Aerodynamic coefficient prediction is pivotal in aircraft and vehicles' design, performance evaluation, and motion control. Integrating artificial neural networks into aerodynamic coefficient prediction offers a promising alternative to traditional numerical methods burdened by extensive computations and high costs. Nevertheless, this data-driven approach faces several critical challenges, which limit its further performance enhancement: i) The current research lacks a profound understanding of the complex interplay between the shape of an object and its aerodynamic characteristics. ii) The scarcity of high-quality aerodynamic data poses a significant barrier. The models trained on limited datasets lack generalization ability, struggling to accurately predict and adapt to diverse aerodynamic performance under new shapes or conditions. To overcome these challenges, we introduce an innovative framework that employs cross-attention to capture the intimate interplay between shape and flow conditions and allows for the direct utilization of pre-trained models on general shape datasets to mitigate the scarcity of aerodynamic data. Furthermore, to bolster the inference capabilities of this data-driven approach, we integrate physical information constraints into the model, leveraging them as guiding principles to enhance the model's predictive power under unknown conditions. Experimental validation demonstrates that our proposed method performs excellently in multiple aerodynamic prediction tasks. This achievement brings a new technological breakthrough to the field of aerodynamic prediction and provides robust support for the design optimization of complex systems such as aircraft and vehicles.

Aerodynamic Coefficients Prediction via Cross-Attention Fusion and Physical-Informed Training

We study the problem of optimizing a guidance policy capable of dynamically guiding the agents for lifelong Multi-Agent Path Finding (LMAPF) based on real-time traffic patterns.
MAPF focuses on moving multiple agents from their start to goal locations without collisions. Its lifelong variant, LMAPF, continuously assigns new goals to agents. To solve LMAPF, replan-based algorithms decompose LMAPF into a series of MAPF problems and solve them sequentially, while rule-based algorithms first plan paths for each agent without considering collisions and then use pre-defined rules to resolve collisions on the fly. Although replan-based algorithms yield higher solution quality than rule-based ones, they scale poorly to challenging LMAPF instances with large numbers of agents and limited planning time. On the other hand, rule-based algorithms scale to extremely challenging instances yet have poor solution quality.
In this work, we focus on improving the solution quality of PIBT, the state-of-the-art rule-based LMAPF algorithm, by optimizing a policy to generate adaptive guidance. We design two pipelines to incorporate guidance in PIBT in two different ways. We demonstrate the superiority of the optimized policy over both static guidance and human-designed policies. Additionally, we explore scenarios where task distribution changes over time, a challenging yet common situation in real-world applications that is rarely explored in the literature.

Online Guidance Graph Optimization for Lifelong Multi-Agent Path Finding

Neural implicit methods have made remarkable progress in 3D reconstruction. However, previous methods often assume view-independent properties of target objects, which fails to accurately reconstruct objects with challenging characteristics, such as transparency and high reflectivity. To address this limitation, we propose a polarimetric implicit 3D reconstruction method that integrates geometric and polarization information, enabling the production of high-quality meshes in complex scenes. For high-fidelity surface reconstruction, we introduce a view-dependent physical representation that thoroughly analyzes the subtle physical properties of reflections. The reconstruction process is further enhanced by a simple yet effective view-dependent detection algorithm and optimized using the principles of ray tracing and polarization. Experimental results demonstrate the superior performance of the proposed method in both real and synthetic scenarios.

High-Fidelity Polarimetric Implicit 3D Reconstruction with View-Dependent Physical Representation

Manipulating human poses based on natural language is an emerging research field that has traditionally focused on coarse commands such as “walking” or “dancing.” However, fine-grained pose manipulation, like instructing “put both hands in front of the stomach,” remains underexplored. In this paper, we introduce PoseLLaVA, a pioneering model that integrates SMPL-based pose representations into the multimodal LLaVA framework. Through a novel pose encoder decoder mechanism, PoseLLaVA achieves precise alignment between pose, textual, and visual modalities, enabling detailed control over pose manipulation tasks. PoseLLaVA excels in three key tasks: pose estimation, generation, and adjustment, all driven by detailed language instructions. We further introduce a fine-grained pose adjustment dataset PosePart, where each sample contains an initial pose and a target pose, along with specific instructions for adjustments, mimicking the guidance a human instructor might provide. Extensive evaluations across these tasks demonstrate significant improvements over existing methods, including metrics such as MPJPE and PA-MPJPE, which measure SMPL reconstruction errors, and Recall rates, which assess feature alignment across modalities. Specifically, PoseLLaVA reduces MPJPE errors by more than 20% compared to state-of-the-art methods in pose adjustment and generation tasks. Additionally, we demonstrate the feasibility of combining PoseLLaVA with generative models, such as diffusion, for pose image editing, highlighting its potential applications in language-controlled pose manipulation.

PoseLLaVA: Pose Centric Multimodal LLM for Fine-Grained 3D Pose Manipulation

Human pose estimation in videos remains a challenge, largely due to the reliance on extensive manual annotation of large datasets, which is expensive and labor-intensive. Furthermore, existing approaches often struggle to capture long-range temporal dependencies and overlook the complementary relationship between temporal pose heatmaps and visual features. To address these limitations, we introduce STDPose, a novel framework that enhances human pose estimation by learning spatiotemporal dynamics in sparsely-labeled videos. STDPose incorporates two key innovations: 1) A novel Dynamic-Aware Mask to capture long-range motion context, allowing for a nuanced understanding of pose changes. 2) A system for encoding and aggregating spatiotemporal representations and motion dynamics to effectively model spatiotemporal relationships, improving the accuracy and robustness of pose estimation. STDPose establishes a new performance benchmark for both video pose propagation (i.e., propagating pose annotations from labeled frames to unlabeled frames) and pose estimation tasks, across three large-scale evaluation datasets. Additionally, utilizing pseudo-labels generated by pose propagation, STDPose achieves competitive performance with only 26.7\% labeled data.

SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos

Tabular data, despite its prevalence in various industries, has been under-explored in deep learning research. Self-supervised learning (SSL) techniques have shown promise for pre-training deep neural networks (DNNs) on tabular data. However, their full potential is yet to be realized due to the challenge of designing appropriate data augmentations for this type of data. Unlike image and language domains, where the success of SSL heavily relies on the inherent structure of the data, such as spatial relationships in images or semantic relationships in text, tabular data lacks such explicit structure. This lack of clear structure makes traditional input-level augmentations, like modifying or removing features, less effective, as they struggle to balance preserving critical information with introducing useful variability. In response to these challenges, we propose RaTab, a novel method that shifts the focus from input-level to representation-level augmentation using matrix factorization, specifically truncated SVD. This shift allows for the preservation of essential data structures while generating a richer diversity of representations with dropout technique. RaTab enhances the effectiveness of SSL for tabular data by focusing on the representation space and utilizing truncated SVD, resulting in significant improvements.

Representation Space Augmentation for Effective Self-Supervised Learning on Tabular Data

In social networks, people influence each other through social links, which can be represented as propagation among nodes in graphs. Influence minimization (IM) is the problem of manipulating the structures of an input graph (e.g., removing edges) to reduce the propagation among nodes. IM can represent time-critical real-world applications, such as rumor blocking, but IM is theoretically difficult and computationally expensive. Moreover, the discrete nature of IM hinders the usage of powerful machine learning techniques, which requires differentiable computation. In this work, we propose DiffIM, a novel method for IM with two differentiable schemes for acceleration: (1) surrogate modeling for efficient influence estimation, which avoids time-consuming simulations (e.g., Monte Carlo), and (2) the continuous relaxation of decisions, which avoids the evaluation of individual discrete decisions (e.g., removing an edge). We further propose a third accelerating scheme, gradient-driven selection, that chooses edges instantly based on gradients without optimization (spec., gradient descent iterations) on each test instance. Through extensive experiments on real-world graphs, we show that each proposed scheme significantly improves speed with little (or even no) IM performance degradation. Our method is Pareto-optimal (i.e., no baseline is faster and more effective than it) and typically  several orders of magnitude (spec., up to 15,160X) faster than the most effective baseline, while being more effective.

Premium content

Next from AAAI 2025

TimeCAP: Learning to Contextualize, Augment, and Predict Time Series Events with Large Language Model Agents

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES