Singapore

Just recognizable distortion (JRD) has been introduced for image compression for machines, aiming to quantify the maximum coding distortion that can be tolerated by a specific perception model, thereby defining the upper bound of machine vision redundancy (MVR). However, existing JRD-based redundancy estimation methods face three key challenges: limited dataset annotation accuracy, low prediction efficiency, and insufficient perception accuracy, all of which hinder their practical deployment. To address these limitations, we propose a new MVR-Net, a frame-wise efficient JRD prediction method that generates the optimal encoding quantization map in a single inference pass. Furthermore, we refine the annotation standard for JRD datasets based on experimental insights, enhancing the precision of recognizable redundancy measurement. Compared to stateof-the-art methods, MVR-Net achieves a superior balance between bitrate reduction and perception accuracy in JRD-guided compression, while offering up to a 40,000× speed improvement, demonstrating its practicality and efficiency for real-world applications.

AAAI 2026

The Last Byte: Learning Just Enough for Machine-Oriented Image Compression

just recognizable distortion

machine-oriented image compression

semantic compression

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Belief-based programming is a probabilistic extension of the GOLOG program family where every action and sensing result can be noisy and every test condition refers to the agent’s subjective beliefs. Inherited from GOLOG programs, the action-centered feature makes belief programs fairly suitable for high-level robot control under uncertainty. An important step before deploying such a program is to verify whether it satisfies certain properties. At least two problems exist in verifying such programs: how to formally specify program properties and what is the complexity of the verification problem.


In this paper, we propose a formalism for belief programs based on a modal logic of actions and beliefs which allows us to conveniently express PCTL-like temporal properties. We also investigate the decidability and undecidability of the verification problem.

A Framework for Belief-based Programs and Their Verification

We study the computational complexity of winner determination problems in approval-based committee elections under Thiele voting rules. These form a class of rules parameterized by a fixed weight vector that specifies how a voter's satisfaction depends on the number of approved candidates elected. We first analyze the structure of optimal solutions based on the sets of voters who approve each candidate---that is, how voters' approval ballots induce dependencies between candidates---revealing constraints on a winning committee under any fixed Thiele voting rule. Using this, we design a set of FPT algorithms for Proportional Approval Voting (PAV) and other Thiele rules on a natural restricted domain known as the Voter Interval (VI) domain---that is, after a suitable ordering of voters, each candidate is approved by a consecutive interval of voters. In particular, we show that every Thiele rule on VI is FPT with respect to a parameter for which the problem is NP-hard on general instance, even when the parameter takes constant values. Our results advance the understanding of the computational complexity of PAV on Voter Interval instances, which remains one of the central open questions in this area [Peters, AAAI 2018]. We further resolve two open questions from the literature on PAV (and other Thiele voting rules) [Yang and Wang, AAMAS 2018] by providing a polynomial-time algorithm for instances where each candidate is approved by at most two voters, and an FPT algorithm parameterized by the total score of a winning committee.

Algorithms for Structured Elections Under Thiele Voting Rules

Large Multimodal Models (LMMs) face notable challenges when encountering multimodal knowledge conflicts, particularly under retrieval-augmented generation (RAG) frameworks, where the contextual information from external sources may contradict the model’s internal parametric knowledge, leading to unreliable outputs. However, existing benchmarks fail to reflect such realistic conflict scenarios. Most focus solely on intra-memory conflicts, while context-memory and inter-context conflicts remain largely unaddressed. Furthermore, commonly used factual knowledge-based evaluations are often overlooked, and existing datasets lack a thorough investigation into conflict detection capabilities.To bridge this gap, we propose MMKC-Bench, a benchmark designed to evaluate factual knowledge conflicts in both context-memory and inter-context scenarios. MMKC-Bench encompasses four types of multimodal knowledge conflicts and includes 1,881 knowledge instances and 3,997 images across 32 broad types, collected through automated pipelines with human verification.
We evaluate four representative series of LMMs on both model behavior analysis and conflict detection tasks. Our findings show that while current LMMs are capable of recognizing knowledge conflicts, they tend to favor internal parametric knowledge over external evidence. We hope MMKC-Bench will foster further research in multimodal knowledge conflict and enhance the development of multimodal RAG systems.

Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models

Large language models (LLMs) have shown impressive multilingual capabilities through pretraining on diverse corpora. While these models show strong reasoning abilities, their performance varies significantly across languages due to imbalanced training data distribution. Existing approaches using sample-level translation for extensive multilingual pretraining and cross-lingual tuning face scalability challenges and often fail to capture nuanced reasoning processes across languages. In this paper, we introduce **AdaMCoT** (Adaptive Multilingual Chain-of-Thought), a framework that enhances multilingual factual reasoning by dynamically routing thought processes in intermediary “thinking languages” before generating target-language responses. AdaMCoT leverages a language-agnostic core and incorporates an adaptive, reward-based mechanism for selecting optimal reasoning pathways without requiring additional pretraining. Our comprehensive evaluation across multiple benchmarks demonstrates substantial improvements in both factual reasoning quality and cross-lingual consistency, with particularly strong performance gains in low-resource language settings. An in-depth analysis of the model’s hidden states and semantic space further elucidates the underlying mechanism of our method. The results suggest that adaptive reasoning paths can effectively bridge the performance gap between high- and low-resource languages while maintaining cultural and linguistic nuances.

AdaMCoT: Rethinking Cross-Lingual Factual Reasoning Through Adaptive Multilingual Chain-of-Thought

Infrared and visible image fusion (IVIF) integrates complementary visual information to produce enhanced representations. However, most existing IVIF methods generate fixed outputs, lacking the flexibility to adapt to user-specified requirements. Recent text-guided approaches offer partial controllability but remain limited to global or semantic-level fusion, unable to achieve instance-level control. This limitation primarily arises from two challenges: the absence of datasets linking textual instructions with corresponding spatial annotations, and the use of coarse cross-modal alignment methods incapable of accurately matching textual inputs with visual features. To overcome these challenges, we propose ControlFuse, a controllable IVIF framework enabling multi-granularity fusion guided directly by user instructions. First, we construct an automated multi-granularity dataset that provides explicit textual-mask correspondences at global, semantic, and instance levels. Second, inspired by manifold geometry, we design a multimodal feature interaction module consisting of a Feature Manifold Converter (FMC) and Curvature-Guided Interaction (CGI). FMC projects textual and visual features into a unified manifold space, while CGI leverages manifold curvature as a geometric cue to refine cross-modal alignment. Extensive experiments demonstrate that ControlFuse achieves precise and flexible controllability across different fusion granularities, benefiting high-level tasks.

ControlFuse: Instruction-guided Multi-Granularity Controllable Image Fusion

The rapid advancement of AIGC techniques has unlocked opportunities in generating diverse and compelling advertisement images based on referenced product images and textual scene descriptions.
This capability substantially reduces human labor and production costs in traditional marketing workflows.
However, existing AIGC techniques either demand extensive fine-tuning for each referenced image to achieve high fidelity, or they struggle to maintain fidelity across diverse products, making them impractical for e-commerce and marketing industries.
To tackle this limitation, we first construct AdProd-100K, a large-scale advertising image generation dataset.
A key innovation in its construction is our dual data augmentation strategy, which fosters robust, 3D-aware representations crucial for realistic and high-fidelity image synthesis.
Leveraging this dataset, we propose RefAdGen, which ensures high fidelity by integrating product features into the text-described scene using the Mask-Guided Attention Fusion (MGAF) mechanism.
Extensive experiments conducted on AdProd-100K demonstrate that RefAdGen achieves state-of-the-art performance in both product fidelity and overall generation quality, offering a scalable and cost-effective alternative to traditional workflows while achieving remarkable visual results.

RefAdGen: High-Fidelity Advertising Image Generation

The rapid proliferation of smart-city ecosystems has significantly amplified the demand for Li-ion batteries, which now serve as the primary energy source for sustainable transportation systems such as e-bikes. Ensuring battery safety and optimal performance is crucial, yet challenging due to complex intrinsic dynamics and extrinsic operating conditions. This paper presents LiBrain, an innovative LLM-powered, time-series-aware retrieval-augmented framework designed to simultaneously address both safety and performance challenges through three synergistic components: (1) a distributed IoT-enabled edge network for continuous real-time battery monitoring and data acquisition, (2) a pretrained deep multi-task diagnostic engine capable of comprehensive battery performance forecasting, and (3) a knowledge-base augmentation module that transforms technical diagnostics into clear, actionable guidance tailored for e-bike users. Functioning as an intelligent battery management assistant, LiBrain effectively bridges the gap between expert-level real-time analytics and practical, user-friendly instructions. Extensive validation across a real-world operational e-bike battery-swap network demonstrates LiBrain's exceptional capabilities, achieving a 95\% adoption rate in hazardous alarm detection and 92\% in battery-status prediction. In real application, Li-Brain has processed over 500 million battery events, managed almost 10 million inquiries and 1 million alarms annually, and identified 10\% of on-site batteries daily for proactive replacement, thereby maintaining operational safety and reliability.

LiBrain: LLM-Powered Li-ion Battery Diagnostics with Time-Series-Aware Retrieval-Augmented Framework for E-bikes

Multimodal sentiment analysis remains a challenging task due to the inherent heterogeneity across modalities. Such heterogeneity often manifests as asynchronous signals, imbalanced information between modalities, and interference from task-irrelevant noise, hindering the learning of robust and accurate sentiment representations. To address these issues, we propose a factorized multimodal fusion framework that first disentangles each modality into shared and unique representations, and then suppresses task-irrelevant noise within both to retain only sentiment-critical representations. This fine-grained decomposition improves representation quality by reducing redundancy, prompting cross-modal complementarity, and isolating task-relevant sentiment cues. Rather than manipulating the feature space directly, we adopt a mutual information–based optimization strategy to guide the factorization process in a more stable and principled manner. To further support feature extraction and long-term temporal modeling, we introduce two auxiliary modules: a Mixture of Q-Formers, placed before factorization, which precedes the factorization and uses learnable queries to extract fine-grained affective features from multiple modalities, and a Dynamic Contrastive Queue, placed after factorization, which stores latest high-level representations for contrastive learning, enabling the model to capture long-range discriminative patterns and improve class-level separability. Extensive experiments on multiple public datasets demonstrate that our method consistently outperforms existing approaches, validating the effecti veness and robustness of the proposed framework.

FINE: Factorized Multimodal Sentiment Analysis via Mutual INformation Estimation

Major progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs to generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short in instilling explicit reasoning capabilities into reward models. To bridge this gap, we propose a self-training approach that can leverage unlabeled data to scale up reward reasoning in reward models. Based on this approach, we develop GRAM-R$^2$, a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R$^2$ can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as policy optimization and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R$^2$ consistently delivers strong performance, outperforming several strong discriminative and generative baselines.

GRAM-R²: Self-Training Generative Foundation Reward Models for Reward Reasoning

Quantifying and understanding human-AI alignment in high-risk tasks such as traffic accident prediction is crucial for deployment of AI systems. Existing alignment studies, however, focus mostly on the static domain and neglect the importance of attentional processing. Here, we present Attention‑DADA, a dataset of accident and non-accident traffic situations that contains detailed human prediction and frame-level eye gaze annotations. Using this benchmark we evaluate open- and closed-source, state‑of‑the‑art large vision-language-models (VLMs) in terms of their alignment in accident prediction performance and attentional processing in both zero-shot and attention-guided settings. Our results show that human prediction performance and consistency improves as the event time approaches. Similarly, human attentional patterns show dynamic updating throughout the event progression. Conversely, while attention guidance improves VLM prediction performance, both performance and attentional alignment stay significantly below human levels. These results provide the first quantitative evidence of misalignment both in terms of performance and attentional processing during analysis of time-critical, dynamic events, highlighting the need for future improvements in this area. Attention‑DADA and all evaluation codes are released on GitHub.

Downloads

Next from AAAI 2026

A Framework for Belief-based Programs and Their Verification

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

A Framework for Belief-based Programs and Their Verification

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads