United States

Graduated optimization is a global optimization technique that is used to minimize a multimodal nonconvex function by smoothing the objective function with noise and gradually refining the solution. This paper experimentally evaluates the performance of the explicit graduated optimization algorithm with an optimal noise scheduling derived from a previous study and discusses its limitations. It uses traditional benchmark functions and empirical loss functions for modern neural network architectures for evaluating. In addition, this paper extends the implicit graduated optimization algorithm, which is based on the fact that stochastic noise in the optimization process of SGD implicitly smooths the objective function, to SGD with momentum, analyzes its convergence, and demonstrates its effectiveness through experiments on image classification tasks with ResNet architectures. The code is available at \url{https://anonymous.4open.science/r/go-25}

AAAI 2025

Explicit and Implicit Graduated Optimization in Deep Neural Networks

optimization

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



The infrequent occurrence of overfitting in deep neural networks is perplexing: seemingly at odds with theoretical predictions, expanding models typically only enhances performance in practical applications. However, what if overfitting does occur, albeit confined to specific sub-regions of the data space? Here, we introduce a novel score that captures the forgetting rate of deep models on validation data. We posit that this score quantifies local overfitting: a decline in performance confined to certain regions of the data space. We then show empirically that local overfitting occurs regardless of the presence of traditional overfitting. Using the framework of deep over-parametrized linear models, we offer a certain theoretical characterization of forgotten knowledge, and show that it correlates with knowledge forgotten by real deep models. Finally, we devise a new ensemble method that aims to recover forgotten knowledge, relying solely on the training history of a single network. This method enhances the performance of any trained model without incurring additional training costs. Extensive empirical evaluations demonstrate the efficacy of our method across multiple datasets, contemporary neural network architectures, and training protocols.

On Local Overfitting and Forgetting in Deep Neural Networks

Language steganography in social networks primarily focuses on embedding secret information into social media text efficiently to achieve covert communication. The misuse of such techniques could pose significant potential threats to public cyberspace, such as the spread of malicious code, commands, or viruses. Existing social text steganalysis techniques mainly focus on the analysis of individual social media texts. However, the information content in a single text is very limited, leading to poor detection performance in practical applications. To address this challenge, this paper proposes a social text steganalysis method that combines large-scale language models with common-sense knowledge graphs (STLC-KG). This method first uses knowledge graphs to expand the knowledge contained in the text under investigation, enriching its linguistic expression, and then utilizes large-scale language models to extract the linguistic features of the social text. The results of tests conducted on three mainstream social media platforms demonstrate that the proposed method significantly improves the performance of social text steganalysis.

STLC-KG:A Social Text Steganalysis Method Combining Large-Scale Language Models and Common-Sense Knowledge Graphs

Local intrinsic dimension (LID) estimation methods have received a lot of attention in recent years thanks to the progress in deep neural networks and generative modeling. In opposition to old non-parametric methods, new methods use generative models to approximate diffused dataset density and scale the methods to high-dimensional datasets like images. In this paper, we investigate the recent state-of-the-art parametric LID estimation methods from the perspective of the Wiener process. We explore how these methods behave when their assumptions are not met. We give an extended mathematical description of those methods and their error as a function of the probability density of the data.

A Wiener Process Perspective on Local Intrinsic Dimension Estimation Methods

Large language models (LLMs) have made significant advancements in math problem solving, but their large size and high latency render them impractical for real-world applications in intelligent mathematics education. Recently, compact models have been developed to replace large LLMs in general natural language processing tasks. However, these models often struggle to acquire sufficient math-related knowledge from LLMs, leading to unsatisfactory performance in solving math word problems (MWPs). To develop a specialized compact model for solving math problems, we develop the knowledge distillation (KD) technique to distill mathematical semantic representations from BERT. Effective knowledge types and distillation strategies are explored through extensive experiments. Our KD algorithm employs multi-knowledge distillation to extract fundamental abstract knowledge from lower layers, as well as mathematics knowledge from higher layers, by leveraging bottleneck linear networks. Pre-training tasks, such as masked language modeling and part-of-speech tagging on MWP datasets, are also utilized to enhance generalization. Additionally, continual learning is employed to prevent catastrophic forgetting of acquired knowledge. Our findings indicate that our approach can reduce the size of a BERT model by 10% while retaining approximately 95% of its performance on MWP datasets, outperforming the mainstream BERT-based compact models. The efficacy of each component has been validated through ablation studies.

A Compact Model for Mathematics Problem Representations Distilled from BERT

With the widespread adoption of AI systems, many of the decisions once made by humans are now delegated to automated systems. Recent works in the literature demonstrate that these automated systems, when used in socially sensitive domains, may exhibit discriminatory behavior based on sensitive characteristics such as gender, sex, religion, or race. In light of this, various notions of fairness and methods to quantify discrimination have been proposed, also leading to the development of numerous approaches for constructing fair predictors. At the same time, imposing fairness constraints may decrease the utility of the decision-maker, highlighting a tension between fairness and utility. This tension is also recognized in legal frameworks, for instance in the disparate impact doctrine of Title VII of the Civil Rights Act of 1964 -- in which specific attention is given to considerations of \textit{business necessity} -- possibly allowing the usage of proxy variables associated with the sensitive attribute in case a high-enough utility cannot be achieved without them. In this work, we analyze the tension between fairness and accuracy from a causal lens for the first time. We introduce the notion of a path-specific excess loss (PSEL) that captures how much the predictor's loss increases when a causal fairness constraint is enforced. We then show that the total excess loss (TEL), defined as the difference between the loss of predictor fair along all causal pathways vs. an unconstrained predictor, can be decomposed into a sum of more local PSELs. At the same time, enforcing a causal constraint often reduces the disparity between demographic groups. Thus, we introduce a quantity that summarizes the fairness-utility trade-off, called the causal fairness/utility ratio, defined as the ratio of the reduction in discrimination vs. the excess in the loss from constraining a causal pathway. This quantity is particularly suitable for comparing the fairness-utility trade-off across different causal pathways. Finally, as our approach requires causally-constrained fair predictors, we introduce a new neural approach for causally-constrained fair learning. Our approach is evaluated across multiple real-world datasets, providing new insights into the tension between fairness and accuracy.

Fairness-Accuracy Trade-Offs: A Causal Perspective

Graph anomaly detection is crucial for identifying anomalous nodes within graphs and addressing applications like financial fraud detection and social spam detection. Recent spectral graph neural network methods advance graph anomaly detection by focusing on anomalies that notably affect the distribution of graph energy. Such spectrum-based methods rely on two steps: graph wavelet extraction and feature fusion. However, both steps are hand-designed, capturing incomprehensive anomaly information of wavelet-specific features and resulting in their inconsistent feature fusion. To address these problems, we propose a dynamic spectral graph anomaly detection framework DSGAD to adaptively capture comprehensive anomaly information and perform consistent feature fusion. DSGAD introduces dynamic wavelets, consisting of trainable wavelets to adaptively learn anomalous patterns and capture wavelet-specific features with comprehensive anomaly information. Furthermore, the consistent fusion of wavelet-specific features achieves dynamic fusion by combining wavelet-specific feature extraction with energy difference and channel convolution fusion using location correlation. Experimental results on four datasets substantiate the efficacy of our DSGAD method, surpassing state-of-the-art methods in both homogeneous and heterogeneous graphs. Our code will be published on GitHub.

Dynamic Spectral Graph Anomaly Detection

Messenger RNA (mRNA)-based vaccines are accelerating the discovery of new drugs and revolutionizing the pharmaceutical industry. However, selecting particular mRNA sequences for vaccines and therapeutics from extensive mRNA libraries is costly. Effective mRNA therapeutics require carefully designed sequences with optimized expression levels and stability. This paper proposes a novel contextual language model (LM)-based embedding method: mRNA2vec. In contrast to existing mRNA embedding approaches, our method is based on the self-supervised teacher-student learning framework of data2vec. We jointly use the 5' untranslated region (UTR) and coding sequence (CDS) region as the input sequences. We adapt our LM-based approach specifically to mRNA by 1) considering the importance of location on the mRNA sequence with probabilistic masking, 2) using Minimum Free Energy (MFE) prediction and Secondary Structure (SS) classification as additional pretext tasks. mRNA2vec demonstrates significant improvements in translation efficiency (TE) and expression level (EL) prediction tasks in UTR compared to SOTA methods such as UTR-LM. It also gives a competitive performance in mRNA stability and protein production level tasks in CDS such as CodonBERT.

mRNA2vec: mRNA Embedding with Language Model in the 5'UTR-CDS for mRNA Design

List Update is a fundamental problem in online algorithms, with a well-known $2$-competitive algorithm that moves every requested element to the front. Randomization can slightly improve the competitive ratio to $1.6$, but not beyond $1.5$. However, practical inputs are not adversarial and one hopes to do better, particularly when additional information from a machine learning oracle is available. With access to predictions, the goal is to incur only a slight overhead compared to the prediction's accuracy, avoiding significant costs in case of substantial deviation.

We propose a $(1+\epsilon)$-smooth randomized algorithm, offering robustness of $O(1/\epsilon^4)$. This guarantees that the algorithm never exceeds a cost greater than $1+\epsilon$ times the prediction cost, while maintaining a bound within $O(1/\epsilon^4)$ of the optimal cost for every possible sequence. In cases where no paid swaps are permitted for the prediction, we can improve robustness to $O(1/\epsilon^2)$ while retaining $1+\epsilon$ smoothness. We complement these findings by demonstrating a lower bound of $\Omega(1/\epsilon)$ on the robustness for deterministic algorithms and $\Omega(\log(1/\epsilon))$ for randomized ones. 
Finally, the experiments we have made show that our algorithms perform better than the standard competitive algorithms for this problem.

List Update with Prediction

In this paper, we offer a learning framework in which the agent’s knowledge gaps are overcome through corrective feedback from a teacher whenever the agent explains its (incorrect) predictions. We test it in a low-resource scenario in visual processing, in which the agent must learn to recognize distinct types of toy trucks. The agent starts the learning process with no ontology about what types of trucks exist nor which parts they have, and a deficient model for recognizing those parts from visual data. The teacher’s feedback to the agent’s explanations addresses its lack of relevant knowledge in the ontology via a generic rule (e.g., “dump trucks have dumpers”), whereas an inaccurate part recognition is corrected by a deictic statement (e.g., “this is not a dumper”). The learner utilizes this feedback not only to improve its estimate of the hypothesis space of possible domain ontologies and probability distributions over them, but also uses those estimates to update its visual interpretation of the scene. Our experiments demonstrate that teacher-learner pairs utilizing explanations and corrections are more data-efficient than those without such a faculty.

Learning Visually Grounded Domain Ontologies via Embodied Conversation and Explanation

Multimodal learning with incomplete modality is practical and challenging. Recently, multimodal prompt-based methods that introduce learnable prompts for missing-modality scenarios have exhibited impressive performance. However, these methods face several unresolved limitations: (1) incomplete modalities provide restricted modal cues for task-specific inference; (2) dummy imputation for missing content causes information loss and introduces additional noise; and (3) static prompts are instance-agnostic, offering limited knowledge for instances with various modal conditions. To address these issues, we propose RAGPT, a novel Retrieval-AuGmented dynamic Prompt Tuning framework. Specifically, RAGPT comprises three key modules: the multi-channel retriever, which identifies similar instances through within-modality retrieval relevance; the missing modality generator, designed to recover missing information using retrieved contexts; and the context-aware prompter, which captures contextual knowledge from relevant instances and generates adaptive prompts to largely enhance the model’s robustness. The framework maintains a model-agnostic design, facilitating seamless integration with various prompt-based models. Extensive experiments conducted on real-world datasets demonstrate that RAGPT, with a mere 3\% of the trainable parameters, consistently outperforms all competitive baselines in handling incomplete modality problems.

Premium content

Next from AAAI 2025

On Local Overfitting and Forgetting in Deep Neural Networks

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES