Singapore

We propose an efficient framework to compress massive video-frame features before feeding them into large multimodal models, thereby mitigating the severe token explosion arising from hour-long videos. Our design leverages a bidirectional state-space model equipped with a gated skip connection and a learnable weighted-average pooling mechanism applied to periodically inserted learned queries. This structure enables hierarchical downsampling across both spatial and temporal dimensions, preserving performance in a cost-effective manner. Across challenging hour-long video understanding tasks, our approach demonstrates competitive results against state-of-the-art models, while significantly reducing overall token budget. Notably, replacing our state-space model with conventional modules results in substantial performance degradation, highlighting the advantages of the proposed state-space modeling for effectively compressing multi-frame video information. Our framework emphasizes resource-conscious efficiency, making it practical for real-world deployments. We validate its scalability and generality across multiple benchmarks, achieving the dual objectives of efficient resource usage and comprehensive video understanding.

AAAI 2026

State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models

nlp: (large) language models

nlp: language grounding & multi-modal nlp

cv: language and vision

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Safe reinforcement learning (RL) has emerged as a key
paradigm for deploying AI in high-stakes domains such as
autonomous driving, robotics, healthcare, and recommender
systems. By embedding explicit constraints into the
learning process, safe RL enables agents to optimize
performance while satisfying critical requirements,
including collision avoidance, resource limits, and system
reliability. Such guarantees are indispensable for
real-world AI, where failures can cause physical harm,
economic loss, or loss of trust. At the same time, demand
for trustworthy AI continues to grow as machine learning is
increasingly deployed in human-centered applications. This
makes it essential to design RL algorithms that are not
only efficient but also reliable, robust, and aligned with
societal needs.

This talk will survey recent progress on the design of safe
and efficient RL algorithms with theoretical guarantees,
focusing on both online and offline settings. I will begin
by outlining the fundamental differences between standard
RL and safe RL, highlighting unique challenges such as the
absence of an optimality Bellman equation, which
necessitates stochastic policies, and the impracticality of
assuming full dataset coverage in offline settings. These
structural gaps underscore the need for new algorithms that
provide both efficiency and rigorous safety guarantees.

Safe Reinforcement Learning for Trustworthy AI: Theory, Algorithms, and Applications

Institutions are key to creating societies that are efficient, fair, and benevolent. Despite their importance, the complexities of human (networked) societies make it difficult to understand how formal institutions form and how they shape human communities. Artificial intelligence (AI) can potentially raise understanding in this regard. Thus, in this paper, we present a simulation model utilizing AI agents to simulate networked societies that contain formal institutions. We then observe the outputs of the resulting model under different societal conditions and formal institutions, and (where applicable) compare and contrast these outputs with political and economic theories. Our model outputs (a) address how inequality impacts societal prosperity, (b) illuminate how institutions can potentially impact poverty, and (c) give insights into the attributes of formal institutions that individuals are inclined to support. These and future simulation models can potentially inform how AI can support the design and development of institutions that facilitate healthier communities and nations.

Toward Simulating Networked Societies with Formal Institutions Using AI Agents

The goal of distributionally robust learning is to learn models capable of performing well against distributional shifts, such as latent heterogeneous subpopulations, unknown covariate shifts, or unmodeled temporal effects. Recently, Duchi and
Namkoong (2021) have proven an upper bound for the excess risk of distributionally robust learning through the lens of covering number argument. However, there are situations where the covering argument fails. This motivates us to study the generalization bound through the lens of Rademacher complexity. More specifically, we consider the Cressie-Read divergence \cite{Cressie1984}, $f_k(t)\propto t^k-1$. Our theoretical results indicate that the
excess risk is of the order $O_P(n^{-\frac{1}{2k_*}})$, where $k_*=\frac{k}{k-1}$. The decay rate of the excess risk increases with increasing $k$.
As illustrative examples, we consider three learning settings: 1) linear classifier; 2) Gaussian reproducing kernel Hilbert space; 3) one-hidden-layer networks. The empirical results validate our theoretical findings.

Rademacher Complexity for Distributionally Robust Learning

Recently, large language models (LLMs) have been explored widely for 3D scene understanding. Among them, training-free approaches are gaining attention for their flexibility and generalization over training-based methods. However, they typically struggle with accuracy and efficiency in practical deployment. To address the problems, we propose Sparse3DPR, a novel training-free framework for open-ended scene understanding, which leverages the reasoning capabilities of pre-trained LLMs and requires only sparse-view RGB inputs. Specifically, we introduce a hierarchical plane-enhanced scene graph that supports open vocabulary and adopts dominant planar structures as spatial anchors, which enables clearer reasoning chains and more reliable high-level inferences. Furthermore, we design a task-adaptive subgraph extraction method to filter query-irrelevant information dynamically, reducing contextual noise and improving 3D scene reasoning efficiency and accuracy. Experimental results demonstrate the superiority of Sparse3DPR, which achieves a 28.7\% EM@1 improvement and a 78.2\% speedup compared with ConceptGraphs on the Space3D-Bench. Moreover, Sparse3DPR obtains comparable performance to training-based methods on ScanQA, with additional real-world experiments confirming its robustness and generalization capability.

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views

Recent advancements in personalized Text-to-Video (T2V) generation have made significant strides in synthesizing character-specific content. However, these methods face a critical limitation: the inability to perform fine-grained control over motion intensity. This limitation stems from an inherent entanglement of action semantics and their corresponding magnitudes within coarse textual descriptions, hindering the generation of nuanced human videos and limiting their applicability in scenarios demanding high precision, such as animating virtual avatars or synthesizing subtle micro-expressions. Furthermore, existing approaches often struggle to preserve high identity fidelity when other attributes are modified. To address these challenges, we introduce MotionCharacter, a framework for high-fidelity human video generation with precise motion control. At its core, MotionCharacter explicitly decouples motion into two independently controllable components: action type and motion intensity. This is achieved through two key technical contributions: (1) a Motion Control Module that leverages textual phrases to specify the action type and a quantifiable metric derived from optical flow to modulate its intensity, guided by a region-aware loss that localizes motion to relevant subject areas; and (2) an ID Content Insertion Module coupled with an ID-Consistency loss to ensure robust identity preservation during dynamic motions. To facilitate training for such fine-grained control, we also curate Human-Motion, a new large-scale dataset with detailed annotations for both motion and facial features. Extensive experiments demonstrate that MotionCharacter achieves substantial improvements over existing methods. Our framework excels in generating videos that are not only identity-consistent but also precisely adhere to specified motion types and intensities. The code, dataset and models will be made publicly available upon acceptance.

MotionCharacter: Fine-Grained Motion Controllable Human Video Generation

Spatial transcriptomics provides unprecedented opportunities to analyze gene expression patterns while preserving spatial tissue architecture. However, traditional deep learning methods face significant challenges in multi-modal data integration, spatial dependency modeling, and biological knowledge incorporation, while existing large language models (LLMs) lack explicit spatial modeling capabilities for transcriptomic data. To address these limitations, we present ST-LLM (Spatial Transcriptomics Embedding with Large Language Models), a novel approach that transforms complex spatial graph structures into structured textual representations suitable for LLMs through innovative prompt engineering. ST-LLM features three key components: dynamic graph adjacency construction using reinforcement learning to adaptively optimize spatial relationships, graph-to-text conversion that creates hierarchical descriptions with spatial context, and comprehensive utilization of pre-trained semantic understanding to generate high-dimensional spatial-aware embeddings. Comprehensive experiments on 14 datasets demonstrate that ST-LLM consistently outperforms state-of-the-art methods in spatial domain clustering and region detection tasks. Our framework establishes LLM embeddings as a simple yet powerful paradigm for encoding spatial transcriptomics biological knowledge, opening new avenues for computational spatial biology research.

ST-LLM: Spatial Transcriptomics Embedding with Large Language Models

Experimental design is critical for evidence-based decision-making in healthcare, marketing, and public policy. However, designing efficient experiments across heterogeneous subgroups presents significant challenges. Existing methods often optimize for statistical power or overall sample efficiency, overlooking crucial fairness considerations across these different subgroups. To address this gap, we introduce a Fairness-Aware Contextual Track-and-Stop Design (F-CTSD) algorithm. The proposed F-CTSD algorithm provides statistical guarantees on subgroup fairness while minimizing required sample sizes. We quantify the fairness-efficiency trade-off and derive the sample complexity bound for the proposed F-CTSD algorithm under its fairness constraints. We further theoretically prove that the proposed F-CTSD algorithm consistently produces accurate treatment effect estimates even under fairness requirements, enhancing statistical reliability. Numerical experiments show that the proposed F-CTSD algorithm outperforms existing methods, achieving higher sample efficiency while reducing subgroup fairness violations by 4.95\%.

Fairness-Aware Design for Contextual Experiments: Guaranteeing Reliability and Equity in Heterogeneous Subgroups

Real-world knowledge graphs (KGs) contain not only standard triple-based facts, but also more complex, heterogeneous types of facts, such as hyper-relational facts with auxiliary key-value pairs, temporal facts with additional timestamps, and nested facts that imply relationships between facts. These richer forms of representation have attracted significant attention due to their enhanced expressiveness and capacity to model complex semantics in real-world scenarios. However, most existing studies suffer from two main limitations: (1) they typically focus on modeling only specific types of facts, thus making it difficult to generalize to real-world scenarios with multiple fact types; and (2) they struggle to achieve generalizable hierarchical (inter-fact and intra-fact) modeling due to the complexity of these representations. To overcome these limitations, we propose UniHR, a Unified Hierarchical Representation learning framework, which consists of a learning-optimized Hierarchical Data Representation (HiDR) module and a unified Hierarchical Structure Learning (HiSL) module. The HiDR module unifies hyper-relational KGs, temporal KGs, and nested factual KGs into triple-based representations. Then HiSL incorporates intra-fact and inter-fact message passing, focusing on enhancing both semantic information within individual facts and enriching the structural information between facts. To go beyond the unified method itself, we further explore the potential of unified representation in complex real-world scenarios. Extensive experiments on 9 datasets across 5 types of KGs demonstrate the effectiveness of UniHR and highlight the strong potential of unified representations. Code and data are available at https://github.com/zjukg/UniHR.

UniHR: Hierarchical Representation Learning for Unified Knowledge Graph Link Prediction

Today's large language models (LLMs) are capable of supporting multilingual scenarios, allowing users to interact with LLMs in their native languages. When LLMs respond to subjective questions posed by users, they are expected to align with the views of specific demographic groups or historical periods, shaped by the language in which the user interacts with the model. Existing studies mainly focus on researching the opinions represented by LLMs among demographic groups in the United States or a few countries, lacking worldwide country samples and studies on human opinions in different historical periods, as well as lacking discussion on using language to steer LLMs. Moreover, they also overlook the potential influence of prompt language on the alignment of LLMs' opinions. In this study, our goal is to fill these gaps. To this end, we create an evaluation framework based on the World Values Survey (WVS) to systematically assess the alignment of LLMs with human opinions across different countries, languages, and historical periods around the world. We find that LLMs appropriately or over-align the opinions with only a few countries while under-aligning the opinions with most countries. Furthermore, changing the language of the prompt to match the language used in the questionnaire can effectively steer LLMs to align with the opinions of the corresponding country more effectively than existing steering methods. At the same time, LLMs are more aligned with the opinions of the contemporary population. To our knowledge, our study is the first comprehensive investigation of the topic of opinion alignment in LLMs across global, language, and temporal dimensions.

On the Alignment of Large Language Models with Global Human Opinion

Delivering judicial decisions requires interpreting complex legal texts, analyzing evidence, and reasoning over jurisprudence and legal principles. Recent advances in Generative Artificial Intelligence, particularly Large Language Models (LLMs), have shown potential to automate parts of this process, yet practical, measurable benefits in real-world judicial settings remain limited. This paper introduces SARA, an LLM-powered legal reasoning platform deployed in a regional Brazilian court, which demonstrates significant efficiency and quality gains through the integration of LLM agents with a Jurisprudential Knowledge Graph (Jur-KG). SARA automatically extracts and structures key elements from legal documents—including claims, requests, and evidence—and generates reasoning grounded in retrieved jurisprudential precedents. The Jur-KG, modeled through an ontology encompassing concepts such as \textit{LegalRelation}, \textit{LegalGrounds}, and \textit{LegalClaims}, enables semantic matching and retrieval of relevant case law. By representing cases according to the Legal Case Ontology for the Brazilian Judicial System, SARA supports traceable reasoning and addresses competence questions to assess coverage, coherence, and justification of AI-generated outputs. Deployment results indicate measurable improvements in processing time, consistency, and explainability, while ensuring compliance with ethical and legal guidelines established by Brazil’s National Council of Justice. This work demonstrates that combining LLM-based agents with domain-specific knowledge graphs can yield both innovative capabilities and proven impact in judicial decision-making.

Downloads

Next from AAAI 2026

Safe Reinforcement Learning for Trustworthy AI: Theory, Algorithms, and Applications

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES