United States

The rapid advancement of multi-agent reinforcement learning(MARL) has given rise to divers training paradigms to learn the policies of each agents in the multi-agent system. The paradigms of decentralized training and execution(DTDE) and centralized training with decentralized execution(CTDE) has been proposed and widely applied. However, as the number of agents increases, the inherent limitations of these frameworks significantly degrade the performance metric, such as win rate, total reward, etc. To reduce the influence of the increasing number of agents on the performance metric, we propose a novel training paradigm of grouped training decentralized execution(GTDE). This framework eliminates the need for a centralized module and relies solely on local information, effectively meeting the training requirements of large-scale multi-agent systems. Specifically, we first introduce an adaptive grouping module, which divides each agent into different groups based on their observation history. To implement end-to-end training, GTDE uses Gumbel-Sigmoid for efficient point-to-point sampling on the grouping distribution while ensuring gradient backpropagation. To adapt to the uncertainty in the number of members in a group, two methods are used to implement a group information aggregation module that merges member information within the group. Empirical results show that in a cooperative environment with 495 agents, GTDE increased the total reward by an average of 8,000 compared to the baseline. In a competitive environment with 64 agents, GTDE achieved a 100\% win rate against the baseline.

AAAI 2025

GTDE: Grouped Training with Decentralized Execution for Multi-agent Actor-Critic

multiagent learning

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



In this paper, we explore how to develop salient object detection models using adder neural networks (ANNs), which are more energy efficient than convolutional neural networks (CNNs), especially for real-world applications. Based on our empirical studies, we show that directly replacing the convolutions in CNN-based models with adder layers leads to a substantial loss of activations in the decoder part. This makes the feature maps learned in the decoder lack pattern diversity and hence results in a significant performance drop. To alleviate this issue, by investigating the statistics of the feature maps produced by adder layers, we introduce a simple yet effective differential merging strategy to augment the feature representations learned by adder layers and present a simple baseline for SOD using ANNs. Experiments on popular salient object detection benchmarks demonstrate that our proposed method with a simple feature pyramid network (FPN) architecture achieves comparable performance to previous state-of-theart CNN-based models and consumes much less energy. We hope this work could facilitate the development of ANNs in binary segmentation tasks.

Exploring Salient Object Detection with Adder Neural Networks

The connections between symbolic rules and neural networks have been explored in various directions, including rule mining through neural networks and rule-based explanation for neural networks. These approaches allow symbolic rules to be extracted from neural network models, which offers explainability to the models. However, the plausibility of the extracted rules is rarely analysed. In this paper, we show that the confidence degrees of extracted rules are generally not high, and we propose a new family of Graph Neural Networks that can be trained with the guidance of rules. Hence, the inference of our model simulates rule reasoning. Moreover, rules with high confidence degrees can be extracted from the trained model that aligns with the inference of the model, which verifies the effectiveness of the rule guidance. Experimental evaluation of knowledge graph reasoning tasks further demonstrates the effectiveness of our model.

Rule-Guided Graph Neural Networks for Explainable Knowledge Graph Reasoning

Dataset distillation (DD) allows datasets to be distilled to fractions of their original size while preserving the rich distributional information so that models trained on the distilled datasets can achieve a comparable accuracy while saving significant computational loads. Recent research in this area has been focusing on improving the accuracy of models trained on distilled datasets. In this paper, we aim to explore a new perspective of DD. We study how to embed adversarial robustness in distilled datasets, so that models trained on these datasets maintain the high accuracy and meanwhile acquire better adversarial robustness. We propose a new method that achieves this goal by incorporating curvature regularization into the distillation process with much less computational overhead than standard adversarial training. Extensive empirical experiments suggest that our method not only outperforms standard adversarial training on both accuracy and robustness with less computation overhead but is also capable of generating robust distilled datasets that can withstand various adversarial attacks.

Towards Adversarially Robust Dataset Distillation by Curvature Regularization

Personalized federated learning (PFL) on graphs is an emerging field focusing on collaborative model development across multiple clients, where each client has distinct graph data distribution while adhering to strict privacy standards. PFL often requires intensive manual intervention with domain knowledge during model design, which hiders the applications of PFL. Recent advances in AutoML and LLMs enable the automatic design of graph neural network architectures by leveraging the power of neural architecture search and the language generation capability of LLMs. However, several technical challenges persist. First, although LLMs are successful in natural language processing, whether they can be used in graph neural architecture search (GNAS) has not been fully explored. Second, while LLMs can guide graph neural architecture search, they do not directly solve the issue of client drift due to heterogeneous data distributions in federated learning. To address these challenges, we introduce a new method, Personalized Federated Graph Neural Architecture Search (PFGNAS). Our approach uses a new set of task-specific prompts to generate and improve the GNN architectures continuously. To solve client drift, PFGNAS uses a weight-sharing strategy of supernet, which can improve the accuracy of each local graph architecture while ensuring local personalization. Extensive evaluations show that PFGNAS significantly outperforms traditional PFL methods, assuring the benefit of integrating LLMs into personalized federated learning. All materials are accessible through https://anonymous.4open.science/r/PFGNAS-07EE.

Large Language Models Enhanced Personalized Graph Neural Architecture Search in Federated Learning

In recent years, applying multi-modal large language models (MLLMs) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, MLLMs comprise the well-known Transformer network, which has a less efficient quadratic computation complexity. In this study, we introduce Cobra, a multi-modal large-scale language model built upon a state-space model, which has demonstrated significant potential in efficiently handling long sequences with fast inference and linear scalability concerning sequence length. Specifically, Cobra involves replacing Transformer-based backbone models (e.g., LLaMA or Phi) with pre-trained Mamba language models. We then empirically explore effective strategies for aligning visual and textual modalities and integrating various pre-trained Mamba model variants with visual encoders. Experiments across various multi-modal benchmarks demonstrate that: (i) Cobra performs 3× ∼ 4× faster than the most computationally efficient state-of-the-art methods, e.g., LLaVA-Phi and MobileVLM v2. Additionally, its performance is significantly enhanced thanks to the implementation of linear sequential modeling. (ii) Cobra fine-tunes a small parameter (∼48% of model parameters), leading to a significant improvement in overall performance compared to LLaVA.

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

Large Language Models (LLMs) are prone to hallucination with non-factual or unfaithful statements, which undermines the applications in real-world scenarios. Recent researches focus on uncertainty-based hallucination detection, which utilizes the output probability of LLMs for uncertainty calculation and does not rely on external knowledge or frequent sampling from LLMs. Whereas, most approaches merely
consider the uncertainty of each independent token, while the intricate semantic relations among tokens and sentences are not well studied, which limits the detection of hallucination that spans over multiple tokens and sentences in the passage. In this paper, we propose a method to enhance uncertainty modeling with semantic graph for hallucination detection. Specifically, we first construct a semantic graph that well captures the relations among entity tokens and sentences. Then, we incorporate the relations between two entities for uncertainty propagation to enhance sentence-level hallucination detection. Given that hallucination occurs due to the conflict between sentences, we further present a graph-based uncertainty calibration method that integrates the contradiction probability of the sentence with its neighbors in the semantic graph for uncertainty calculation. Extensive experiments on two datasets show the great advantages of our proposed approach. In particular, we obtain substantial improvements with 19.78% in passage-level hallucination detection.

Enhancing Uncertainty Modeling with Semantic Graph for Hallucination Detection

Visual Entity Linking (VEL) is a crucial task for achieving fine-grained visual understanding, matching objects within images (visual mentions) to entities in a knowledge base. Previous VEL tasks rely on textual inputs, but writing queries for complex scenes can be challenging. Visual inputs like clicks or bounding boxes offer a more convenient alternative. Therefore, we propose a new task, Pixel-Level Visual Entity Linking (PL-VEL), which uses pixel masks from visual inputs to refer to objects, complementing previous textual methods for VEL. To facilitate research on this task, we have constructed the MaskOven-Wiki dataset through an entirely automatic reverse region-entity annotation framework. This dataset contains more than 5 million annotations. As far as we know, this dataset is the first multimodal dataset that aligns pixel-level regions with entity-level labels, which will advance visual understanding towards fine-grained. Moreover, as pixel masks correspond to semantic regions in an image, we enhance previous patch-interacted attention with region-interacted attention by a visual semantic tokenization approach. Manual evaluation results indicate that the reverse annotation framework achieved a 94.8% annotation success rate. Experimental results show that models trained on this dataset improved accuracy by 18 points compared to zero-shot models. Additionally, the semantic tokenization method demonstrated a 5-point accuracy improvement over the trained baseline. The MaskOven-Wiki dataset will be available at https://github.com/xxx/xxx.

Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking

Split Federated Learning (SFL) splits and collaboratively trains a shared model between clients and server, where clients transmit activations and client-side models to server for updates. Recent SFL studies assume synchronous transmission of activations and client-side models from clients to server. However, due to significant variations in computational and communication capabilities among clients, activations and client-side models arrive at server asynchronously. The delay caused by asynchrony significantly degrades the performance of SFL. To address this issue, we consider an asynchronous SFL framework, where an activation buffer and a model buffer are embedded on the server to manage the asynchronously transmitted activations and client-side models, respectively. Furthermore, as asynchronous activation transmissions cause the buffer to frequently receive activations from resource-rich clients, leading to biased updates of the server-side model, we propose Generative activations-aided Asynchronous SFL (GAS). In GAS, the server maintains an activation distribution for each label based on received activations and generates activations from these distributions according to the degree of bias. These generative activations are then used to assist in updating the server-side model, ensuring more accurate updates. We derive a tighter convergence bound, and our experiments demonstrate the effectiveness of the proposed method.

GAS: Generative Activation-Aided Asynchronous Split Federated Learning

Recent advances in Multimodal Large Language Models (MLLMs) mainly enhance their performance by improving the image resolution, which not only adds to the already high computational overhead, but also significantly increases the number of redundant tokens.Traditional pruning methods typically focus on foreground elements, making them unsuitable for MLLMs where some questions also involve background context.To address this challenge, in this paper, we introduce OncePrune, an innovative, training-free approach for adaptive visual token pruning in MLLMs. This method strategically eliminates redundant tokens by framing the pruning process as a connected component. The aim is to identify and retain the most significant token for each visual element, whether located in the foreground or background.In practice, OncePrune constructs an adjacency matrix from visual token similarities, and obtains the most significant tokens for each component by iterating the potential information flow.To validate OncePrune, we apply it to a set of commonly used MLLMs, including LLaVA-1.5, LLaVA-NeXT, and conduct extensive experiments on a set of vision-language benchmarks.The experiment results demonstrate that OncePrune significantly reduces computation overhead, especially in the fine-grained tasks. For instance, OncePrune reduces 63.57% FLOPs on LLaVA-NeXT for TextVQA with only 2% accuracy drop. Our code is provided in the appendix.

What kind of visual tokens do we need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph

Automatic Speech Recognition (ASR) transcripts exhibit recognition errors and various spoken language phenomena such as disfluencies, ungrammatical sentences, and incomplete sentences, hence suffering from poor readability. To improve readability, we propose a Contextualized Spoken-to-Written conversion (CoS2W) task to address ASR and grammar errors and also transfer the informal text into the formal style with content preserved, utilizing contexts and auxiliary information. This task naturally matches the in-context learning capabilities of Large Language Models (LLMs). To facilitate comprehensive comparisons of various LLMs, we construct a document-level Spoken-to-Written conversion of ASR Transcripts Benchmark (SWAB) dataset. Using SWAB, we study the impact of different granularity levels on the CoS2W performance, and propose methods to exploit contexts and auxiliary information to enhance the outputs. Experimental results reveal that LLMs have the potential to excel in the CoS2W task, particularly in grammaticality and formality, our methods achieve effective understanding of contexts and auxiliary information by LLMs. We further investigate the effectiveness of using LLMs as evaluators and find that LLM evaluators show strong correlations with human evaluations on rankings of faithfulness and formality, which validates the reliability of LLM evaluators for the CoS2W task.

Premium content

Next from AAAI 2025

Exploring Salient Object Detection with Adder Neural Networks

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES