Singapore

Collaborative perception enables connected vehicles to share information, overcoming occlusions and extending the limited sensing range inherent in single-agent (non-collaborative) systems. Existing vision-only methods for 3D semantic occupancy prediction commonly rely on dense 3D voxels, which incur high communication costs, or 2D planar features, which require accurate depth estimation or additional supervision, limiting their applicability to collaborative scenarios. To address these challenges, we propose the first approach leveraging sparse 3D semantic Gaussians for collaborative 3D semantic occupancy prediction. By sharing and fusing intermediate Gaussian primitives, our method provides three benefits: a neighborhood-based cross-agent fusion that removes duplicates and suppresses noisy or inconsistent Gaussians; a joint encoding of geometry and semantics in each primitive, which reduces reliance on depth supervision and allows simple rigid alignment; and sparse, object-centric messages that preserve structural information while reducing communication volume. Extensive experiments demonstrate that our approach outperforms single-agent perception and baseline collaborative methods by +8.42 and +3.28 points in mIoU, and +5.11 and +22.41 points in IoU, respectively. When further reducing the number of transmitted Gaussians, our method still achieves a +1.9 improvement in mIoU, using only 34.6% communication volume, highlighting robust performance under limited communication budgets.

AAAI 2026

Vision-Only Gaussian Splatting for Collaborative Semantic Occupancy Prediction

cv: scene analysis & understanding cv: vision for robotics & autonomous driving cv: 3d computer vision

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Safe reinforcement learning (RL) has emerged as a key
paradigm for deploying AI in high-stakes domains such as
autonomous driving, robotics, healthcare, and recommender
systems. By embedding explicit constraints into the
learning process, safe RL enables agents to optimize
performance while satisfying critical requirements,
including collision avoidance, resource limits, and system
reliability. Such guarantees are indispensable for
real-world AI, where failures can cause physical harm,
economic loss, or loss of trust. At the same time, demand
for trustworthy AI continues to grow as machine learning is
increasingly deployed in human-centered applications. This
makes it essential to design RL algorithms that are not
only efficient but also reliable, robust, and aligned with
societal needs.

This talk will survey recent progress on the design of safe
and efficient RL algorithms with theoretical guarantees,
focusing on both online and offline settings. I will begin
by outlining the fundamental differences between standard
RL and safe RL, highlighting unique challenges such as the
absence of an optimality Bellman equation, which
necessitates stochastic policies, and the impracticality of
assuming full dataset coverage in offline settings. These
structural gaps underscore the need for new algorithms that
provide both efficiency and rigorous safety guarantees.

Safe Reinforcement Learning for Trustworthy AI: Theory, Algorithms, and Applications

Institutions are key to creating societies that are efficient, fair, and benevolent. Despite their importance, the complexities of human (networked) societies make it difficult to understand how formal institutions form and how they shape human communities. Artificial intelligence (AI) can potentially raise understanding in this regard. Thus, in this paper, we present a simulation model utilizing AI agents to simulate networked societies that contain formal institutions. We then observe the outputs of the resulting model under different societal conditions and formal institutions, and (where applicable) compare and contrast these outputs with political and economic theories. Our model outputs (a) address how inequality impacts societal prosperity, (b) illuminate how institutions can potentially impact poverty, and (c) give insights into the attributes of formal institutions that individuals are inclined to support. These and future simulation models can potentially inform how AI can support the design and development of institutions that facilitate healthier communities and nations.

Toward Simulating Networked Societies with Formal Institutions Using AI Agents

Multimodal Model Editing (MMED) aims to correct erroneous knowledge in multimodal models. Existing evaluation methods, adapted from textual model editing, focus on fact recall and prediction preservation for unrelated inputs to assess locality. However, these typically rely on low-similarity or random input pairs, which can overstate editing success and obscure overfitting effects. To address this limitation, we propose a comprehensive locality evaluation framework for MMED, spanning three key dimensions: **random-image locality, no-image locality,** and **consistent-image locality**. These dimensions are operationalized through seven distinct data types, enabling a detailed and structured analysis of multimodal edits. In addition, we introduce **dynamic evaluation for visual question answering (De-VQA)**, which dynamically selects data samples based on the specific edits applied. This exposes limitations in existing locality metrics. Using De-VQA, we uncover a phenomenon we term **transient blindness**, a form of overfitting where edited models overly rely on textual input similar to the edit, while disregarding relevant visual information. We analyze this effect by quantifying cross-modal token contributions, revealing that edits tend to disproportionately affect textual tokens, resulting in excessive dependence on language. To mitigate this problem, we propose locality-aware adversarial losses that encourage a more balanced integration of textual and visual representations. Empirical results demonstrate that our approach consistently outperforms existing baselines, reducing transient blindness and improving locality preservation by an average of 17\% across multiple models and datasets.

Uncovering and Mitigating Transient Blindness in Multimodal Model Editing

Rectified Flow (RF) has been widely used as an effective generative model. Although RF is primarily based on probability flow Ordinary Differential Equations (ODE), recent studies have shown that injecting noise through reverse-time Stochastic Differential Equations (SDE) for sampling can achieve superior generative performance. Inspired by Positive-incentive Noise ($\pi$-noise), we propose an innovative generative algorithm to train $\pi$-noise generators, namely Rectified Noise ($\Delta$RN), which improves the generative performance by injecting $\pi$-noise into the velocity field of pre-trained RF models. After introducing the Rectified Noise pipeline, pre-trained RF models can be efficiently transformed into $\pi$-noise generators. We validate Rectified Noise by conducting extensive experiments across various model architectures on different datasets. Notably, we find that: (1) RF models using Rectified Noise reduce FID from10.16 to 9.05 on ImageNet-1k. (2) The models of $\pi$-noise generators achieve improved performance with only 0.39\% additional training parameters.

Rectified Noise: A Generative Model Using Positive-incentive Noise

We study the problem of allocating indivisible goods among agents with additive valuation functions to achieve both fairness and efficiency under the constraint that each agent receives exactly the same number of goods (the balanced constraint). While this constraint is common in real-world scenarios such as team drafts or asset division, it significantly complicates the search for allocations that are both fair and efficient. Envy-freeness up to one good (EF1) is a well-established fairness notion for indivisible goods. Pareto optimality (PO) and its stronger variant, fractional Pareto optimality (fPO), are widely accepted efficiency criteria. Our main contribution establishes both the existence and polynomial-time computability of allocations that are simultaneously EF1 and fPO under balanced constraints in two fundamental cases: (1) when agents have at most two distinct types of valuation functions, and (2) when each agent has a personalized bivalued valuation. Our algorithms leverage novel applications of maximum-weight matching in bipartite graphs and duality theory, providing the first polynomial-time solutions for these cases and offering new insights for constrained fair division problems.

Fair and Efficient Balanced Allocation for Indivisible Goods

Recent advances in training-free visual prompting, such as Set-of-Mark, have emerged as a promising direction for enhancing the grounding capabilities of multimodal language models (MLMs). These techniques operate by partitioning the input image into object regions and annotating them with marks–predominantly boxes with numeric identifiers–before feeding the augmented image to the MLM. However, these approaches treat marked objects as isolated entities, failing to capture the relationships between them. On these premises, we propose Graph-of-Mark (GoM), the first pixel-level visual prompting technique that overlays scene graphs onto the input image for spatial reasoning tasks. We evaluate GoM across 3 open-source MLMs and 4 different datasets, conducting extensive ablations on drawn components and investigating the impact of auxiliary graph descriptions in the text prompt. Our results demonstrate that GoM consistently improves the zero-shot capability of MLMs in interpreting object positions and relative directions, improving base accuracy in visual question answering and localization up to 11 percentage points.

Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting

Recent advances in LLM agentic systems have improved the automation of offensive security tasks, particularly for Capture the Flag (CTF) challenges. We systematically investigate the key factors that drive agent success and provide a detailed recipe for building effective LLM-based offensive security agents. First, we present CTFJudge, a framework leveraging LLMs as judges to analyze agent trajectories and provide granular evaluation across CTF solving steps. Second, we propose a novel metric, the CTF Competency Index (CCI), for partial correctness, revealing how closely agent solutions align with human-crafted gold standards. Third, we examine how LLM hyperparameters—namely temperature, top-p, and maximum token length—influence agent performance and automated cybersecurity task planning. For rapid evaluation, we present CTFTiny, a curated benchmark of 50 representative CTF challenges across binary exploitation, web, reverse engineering, forensics, and cryptography. Our findings identify optimal multi-agent coordination settings and lay the groundwork for future LLM agent research in cybersecurity.

Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark

We propose an efficient framework to compress massive video-frame features before feeding them into large multimodal models, thereby mitigating the severe token explosion arising from hour-long videos. Our design leverages a bidirectional state-space model equipped with a gated skip connection and a learnable weighted-average pooling mechanism applied to periodically inserted learned queries. This structure enables hierarchical downsampling across both spatial and temporal dimensions, preserving performance in a cost-effective manner. Across challenging hour-long video understanding tasks, our approach demonstrates competitive results against state-of-the-art models, while significantly reducing overall token budget. Notably, replacing our state-space model with conventional modules results in substantial performance degradation, highlighting the advantages of the proposed state-space modeling for effectively compressing multi-frame video information. Our framework emphasizes resource-conscious efficiency, making it practical for real-world deployments. We validate its scalability and generality across multiple benchmarks, achieving the dual objectives of efficient resource usage and comprehensive video understanding.

State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models

Today's large language models (LLMs) are capable of supporting multilingual scenarios, allowing users to interact with LLMs in their native languages. When LLMs respond to subjective questions posed by users, they are expected to align with the views of specific demographic groups or historical periods, shaped by the language in which the user interacts with the model. Existing studies mainly focus on researching the opinions represented by LLMs among demographic groups in the United States or a few countries, lacking worldwide country samples and studies on human opinions in different historical periods, as well as lacking discussion on using language to steer LLMs. Moreover, they also overlook the potential influence of prompt language on the alignment of LLMs' opinions. In this study, our goal is to fill these gaps. To this end, we create an evaluation framework based on the World Values Survey (WVS) to systematically assess the alignment of LLMs with human opinions across different countries, languages, and historical periods around the world. We find that LLMs appropriately or over-align the opinions with only a few countries while under-aligning the opinions with most countries. Furthermore, changing the language of the prompt to match the language used in the questionnaire can effectively steer LLMs to align with the opinions of the corresponding country more effectively than existing steering methods. At the same time, LLMs are more aligned with the opinions of the contemporary population. To our knowledge, our study is the first comprehensive investigation of the topic of opinion alignment in LLMs across global, language, and temporal dimensions.

On the Alignment of Large Language Models with Global Human Opinion

Delivering judicial decisions requires interpreting complex legal texts, analyzing evidence, and reasoning over jurisprudence and legal principles. Recent advances in Generative Artificial Intelligence, particularly Large Language Models (LLMs), have shown potential to automate parts of this process, yet practical, measurable benefits in real-world judicial settings remain limited. This paper introduces SARA, an LLM-powered legal reasoning platform deployed in a regional Brazilian court, which demonstrates significant efficiency and quality gains through the integration of LLM agents with a Jurisprudential Knowledge Graph (Jur-KG). SARA automatically extracts and structures key elements from legal documents—including claims, requests, and evidence—and generates reasoning grounded in retrieved jurisprudential precedents. The Jur-KG, modeled through an ontology encompassing concepts such as \textit{LegalRelation}, \textit{LegalGrounds}, and \textit{LegalClaims}, enables semantic matching and retrieval of relevant case law. By representing cases according to the Legal Case Ontology for the Brazilian Judicial System, SARA supports traceable reasoning and addresses competence questions to assess coverage, coherence, and justification of AI-generated outputs. Deployment results indicate measurable improvements in processing time, consistency, and explainability, while ensuring compliance with ethical and legal guidelines established by Brazil’s National Council of Justice. This work demonstrates that combining LLM-based agents with domain-specific knowledge graphs can yield both innovative capabilities and proven impact in judicial decision-making.

Downloads

Next from AAAI 2026

Safe Reinforcement Learning for Trustworthy AI: Theory, Algorithms, and Applications

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES