United States

Understanding people&#39;s social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can watch people&#39;s actions, hear their conversations, and/or read about their past behaviors. For AI systems to successfully and safely interact with people in real-world environments, they also need to understand people&#39;s mental states as well as their inferences about each other&#39;s mental states based on multi-modal information about their interactions. For this, we introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark. MuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates mental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide video and text descriptions of people&#39;s multi-modal behavior in realistic household environments. Based on the context, we then ask questions about people&#39;s goals, beliefs, and beliefs about others&#39; goals. We validated MuMA-ToM in a human experiment and provided a human baseline. We also proposed a novel multi-modal, multi-agent ToM model, LIMP (Language model-based Inverse Multi-agent Planning). Our experimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal models (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM.

AAAI 2025

MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

social cognition and interaction

Understanding people's social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can watch people's actions, hear their conversations, and/or read about their past behaviors. For AI systems to successfully and safely interact with people in real-world environments, they also need to understand people's mental states as well as their inferences about each other's mental states based on multi-modal information about their interactions. For this, we introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark. MuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates mental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide video and text descriptions of people's multi-modal behavior in realistic household environments. Based on the context, we then ask questions about people's goals, beliefs, and beliefs about others' goals. We validated MuMA-ToM in a human experiment and provided a human baseline. We also proposed a novel multi-modal, multi-agent ToM model, LIMP (Language model-based Inverse Multi-agent Planning). Our experimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal models (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM.

technical paper

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Computing an optimal classification tree that provably maximizes training performance within a given size limit, is NP-Hard and in practice, most state-of-the-art methods do not scale beyond computing optimal trees of depth three. Therefore, most methods rely on a coarse binarization of continuous features to maintain scalability. We propose a novel algorithm that directly optimizes on the continuous feature data using dynamic programming with branch-and-bound. We develop new lower-bounding techniques that eliminate many sub-optimal splits in the search when similar to previously computed splits and we provide an efficient subroutine for computing optimal depth-two trees. Our experiments demonstrate that these techniques yield a runtime improvement of two orders of magnitude over state-of-the-art optimal methods and improve test accuracy by 5% over greedy heuristics.

Optimal Classification Trees for Continuous Feature Data Using Dynamic Programming with Branch-and-Bound

Understanding causal relationships among the variables of a system is paramount to explain and control its behaviour. For many real world systems, however, the true causal graph is not readily available and one must resort to predictions made by algorithms or domain experts. Therefore, metrics that quantitatively assess the goodness of a causal graph provide helpful checks before using it in downstream tasks. Existing metrics provide an *absolute* number of inconsistencies between the graph and the observed data, and without a baseline, practitioners are left to answer the hard question of how many such inconsistencies are acceptable or expected. Here, we propose a novel consistency metric by constructing a baseline through node permutations. By comparing the number of inconsistencies with those on the baseline, we derive an interpretable metric that captures whether the graph is significantly better than random. Evaluating on both simulated and real data sets from various domains, including biology and cloud monitoring, we demonstrate that the true graph is not falsified by our metric, whereas the wrong graphs given by a hypothetical user are likely to be falsified.

Toward Falsifying Causal Graphs Using a Permutation-Based Test

Anomaly detection on graphs has garnered considerable attention due to its critical applications, such as detecting money laundering in financial systems and identifying fake reviews on social networks. Traditional fraud detection methods typically focus on either fraudsters (node-level) or criminal activities (graph-level) in isolation, while availing the latent information associated with transactions (edge-level). Similarly, fake review detection often considers either individual fake reviews (node-level) or collaborative reviewer groups (graph-level) separately, based on designed relationships (edge-level). However, this separation neglects the interconnections and frequent co-occurrences across different levels, limiting the effective use of complementary information to identify community anomalies that emerge from the collective behavior of individual anomalies. Additionally, the inherent imbalance in anomaly detection, where anomalous instances are outnumbered by normal samples, exacerbates these challenges. To address these issues, we propose UniFORM, a unified self-supervised anomaly detection framework, that integrates node, edge, and graph-level tasks. First, we extract centralized and decentralized communities as multi-grained contexts and employ an energy-based GNN to reveal anomalous properties. Then, we construct the Phantom Pool, Query Pool, and Support Pool by the enhancement of meta-learning with contrastive learning. Finally, we design unified loss in intra-inter perspectives. Comprehensive experiments on real-world datasets substantiate that our framework markedly surpasses state-of-the-art methods across these multi-grained levels.

UniFORM: Towards Unified Framework for Anomaly Detection on Graphs

Large Language Models (LLMs) have shown great promise in vulnerability identification. As C/C++ comprise half of the Open-Source Software (OSS) vulnerabilities over the past decade and updates in OSS mainly occur through commits, enhancing LLMs' ability to identify C/C++ Vulnerability-Contributing Commits (VCCs) is essential. However, current studies primarily focus on further pre-training LLMs on massive code datasets, which is resource-intensive and poses efficiency challenges.
In this paper, we enhance the ability of BERT-based LLMs to identify C/C++ VCCs in a lightweight manner. We propose CodeLinguaNexus (CLNX) as a bridge facilitating communication between C/C++ programs and LLMs. Based on commits, CLNX efficiently converts the source code into a more natural representation while preserving key details. Specifically, CLNX first applies structure-level naturalization to decompose complex programs, followed by token-level naturalization to interpret complex symbols. We evaluate CLNX on public datasets of 25,872 C/C++ functions with their commits. The results demonstrate that CLNX substantially improves the ability of LLMs to detect C/C++ VCCs. Moreover, CLNX-equipped CodeBERT achieves new state-of-the-art and identifies 38 OSS vulnerabilities in the real world.

CLNX: Bridging Code and Natural Language for C/C++ Vulnerability-Contributing Commits Identification

Multi-agent collaborative perception is expected to significantly improve perception performance by overcoming the limitations of single-agent perception through exchanging complementary information. However, training a robust collaborative perception model requires collecting sufficient training data that covers all possible collaboration scenarios, which is impractical due to intolerable deployment costs. Hence, the trained model is not robust against new traffic scenarios with inconsistent data distribution and fundamentally restricts its real-world applicability. Further, existing methods, such as domain adaptation, have mitigated this issue by exposing the deployment data during the training stage but incur a high training cost, which is infeasible for resource-constrained agents. In this paper, we propose a Parameter-Efficient Fine-Tuning-based lightweight framework, CoPEFT, for fast adapting a trained collaborative perception model to new deployment environments under low-cost conditions. CoPEFT develops a Collaboration Adapter and Agent Prompt to perform macro-level and micro-level adaptations separately. Specifically, the Collaboration Adapter utilizes the inherent knowledge from training data and limited deployment data to adapt the feature map to new data distribution. The Agent Prompt further enhances the Collaboration Adapter by inserting fine-grained contextual information about the environment. Extensive experiments demonstrate that our CoPEFT surpasses existing methods with less than 1\% trainable parameters, proving the effectiveness and efficiency of our proposed method. The code will be open-sourced following the acceptance of this paper.

CoPEFT: Fast Adaptation Framework for Multi-Agent Collaborative Perception with Parameter-Efficient Fine-Tuning

In recent years, reconstructing indoor scene geometry from multi-view images has achieved encouraging accomplishments. Current methods incorporate monocular priors into neural implicit surface models to achieve high-quality reconstructions. However, these methods require hundreds of images for scene reconstruction. When only a limited number of views are available as input, the performance of monocular priors deteriorates due to scale ambiguity, leading to the collapse of the reconstructed scene geometry. In this paper, we propose a new method, named Sparis, for indoor surface reconstruction from sparse views. Specifically, we investigate the impact of monocular priors on sparse scene reconstruction, introducing a novel prior based on inter-image matching information. Our prior offers more accurate depth information while ensuring cross-view matching consistency. Additionally, we employ an angular filter strategy and an epipolar matching weight function, aiming to reduce errors due to view matching inaccuracies, thereby refining the inter-image prior for improved reconstruction accuracy. The experiments conducted on widely used benchmarks demonstrate superior performance in sparse-view scene reconstruction.

Sparis: Neural Implicit Surface Reconstruction of Indoor Scenes from Sparse Views

Applying Reinforcement Learning (RL) to Restless Multi-Arm Bandits (RMABs) offers a promising avenue for addressing allocation problems with resource constraints and temporal dynamics. However, classic RMAB models largely overlook the challenges of (systematic) data errors—a common occurrence in real-world scenarios due to factors like varying data collection protocols and intentional noise for differential privacy. We demonstrate that conventional RL algorithms used to train RMABs can struggle to perform well in such settings. To solve this problem, we propose the first communication learning approach in RMABs, where we study which arms, when involved in communication, are most effective in mitigating the influence of such systematic data errors. 
%To solve this problem, we propose a novel communication learning approach that enables arms to identify and mitigate the influence of such systematic data errors by learning from other arms via communication. 
In our setup, the arms receive Q-function parameters from similar arms as messages to guide behavioral policies, steering Q-function updates. We learn communication strategies by considering the joint utility of messages across all pairs of arms and using a Q-network architecture that decomposes the joint utility. Both theoretical and empirical evidence validate the effectiveness of our method in significantly improving RMAB performance across diverse problems.

The Bandit Whisperer: Communication Learning for Restless Bandits

Person re-identification (Re-ID) is crucial for intelligent surveillance systems, facilitating the identification of individuals across multiple camera views. While significant advancements have been made for daytime scenarios, ensuring reliable Re-ID performance during nighttime remains a significant challenge. Given the cost and limited accessibility of infrared cameras, we investigate a critical question: Can RGB cameras be effectively utilized for accurate Re-ID during nighttime? To address this, we introduce NightReID, a large-scale RGB Re-ID dataset collected from a real-world nighttime surveillance system. NightReID includes 1,500 identities and over 53,000 images, capturing diverse scenes with complex lighting and adverse weather conditions. This rich dataset provides a valuable benchmark for advancing nighttime Re-ID research. Moreover, we propose two novel modules to enhance nighttime Re-ID performance. First, an unsupervised Image Enhancement and Denoising (IED) method is designed to improve the quality of nighttime images, preserving critical details while removing noise without requiring paired ground truth. Second, we introduce Data Distribution Alignment (DDA) through statistical priors, aligning the distributions between pre-training data and nighttime data to mitigate domain shift. Extensive experiments on multiple nighttime Re-ID datasets demonstrate the significance of NightReID and validate the efficacy, flexibility, and applicability of our proposed methods.

NightReID: A Large-Scale Nighttime Person Re-Identification Benchmark

The ability to reason at multiple levels of temporal abstraction is a fundamental aspect of intelligence. In reinforcement learning (RL), this attribute is often modelled through temporally extended courses of actions called options (Sutton et al. 1999). In this talk, I will introduce a general framework for option discovery, which uses the agent's representation to discover useful options (Machado et al., 2023). By leveraging these options to generate a rich stream of experience, the agent can improve its representations and learn more effectively. This representation-driven option discovery approach creates a virtuous cycle of refinement, continuously improving both the representation and options, and it is particularly effective for problems where agents need to operate at varying levels of abstraction to succeed.

Representation-driven Option Discovery in Reinforcement Learning

Representation learning constructs low-dimensional representations to
summarize essential features of high-dimensional data. This learning
problem is often approached by describing various desiderata
associated with learned representations; e.g., that they be
non-spurious, efficient, or disentangled. It can be challenging,
however, to turn these intuitive desiderata into formal criteria that
can be measured and enhanced based on observed data. In this paper, we
take a causal perspective on representation learning, formalizing
desiderata like non-spuriousness and demonstrating their practical utility.

Premium content

Next from AAAI 2025

Optimal Classification Trees for Continuous Feature Data Using Dynamic Programming with Branch-and-Bound

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES