Singapore

Current Multimodal Chain-of-Thought (MCoT) methods suffer from *low-quality multimodal reasoning*, characterized by overthinking on simple queries and inefficient utilization of visual information, resulting in vast inefficient and ineffective computations.
In this paper, we discover that Multimodal Large Language Models (MLLMs) possess inherent capabilities to distinguish between simple and difficult queries and enhance task-related visual information, which remain underutilized by existing approaches. 
Based on this insight, we propose Self-Driven Refined Multimodal CoT (SDR-MCoT), a training-free framework that mitigates these issues through two self-driven modules. 
First, our selective thinking module employs entropy-based confidence estimation to determine whether queries require detailed reasoning, preventing overthinking on simple questions. 
Second, our step-wise visual enhancement module strengthens attention to relevant visual regions at each reasoning step without inserting additional tokens, achieving fine-grained visual grounding and enhancement with minimal overhead. 
Moreover, SDR-MCoT can be seamlessly integrated into various MLLMs, offering a practical solution for improving multimodal reasoning.
Comprehensive experiments across eight benchmarks from diverse domains (multimodal reasoning, visual understanding, hallucination, and mathematical reasoning) demonstrate that SDR-MCoT consistently outperforms existing MCoT methods on four different base models with reduced overhead. 
For instance, on Qwen2-VL-7B, our method improves average accuracy by over 6\% while reducing token consumption by approximately 60\% compared to zero-shot CoT.

AAAI 2026

Mitigating Low-Quality Reasoning in MLLMs: Self-Driven Refined Multimodal CoT with Selective Thinking and Step-wise Visual Enhancement

cv: language and vision

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The rapid proliferation of social media platforms has led to a surge in multimodal fake news, where deceptive content often combines text and images to mislead audiences. Traditional unimodal detection methods struggle to address the complexity of such content, necessitating holistic multimodal approaches. While the latest advancements in Multimodal Large Language Models (MLLMs) offer new opportunities for enhancing detection performance by analyzing multi-dimensional features, including source credibility, cross-modal contradictions, emotional bias, and manipulative writing patterns, these methods suffer from a key flaw: a susceptibility to hallucinations or erroneous reasoning, which can lead to flawed conclusions and ultimately biased detection results. We propose the Multimodal Fake News Detection via Multi-perspective Rationale Generation and Verification (MMRGV) model to mitigate this challenge. Our method employs a cross-verification mechanism to screen and reconcile contradictions among different rationales, thereby preserving the LLM's analytical advantages while mitigating the impact of erroneous reasoning or hallucinations on the final detection. Subsequently, these optimized rationales are fused via an adaptive weighting strategy to output a robust final prediction. Extensive experiments on three benchmark datasets (Twitter, Weibo, and GossipCop) demonstrate the superiority of our method, achieving state-of-the-art accuracy of 0.9972, 0.9663, and 0.8772, respectively, and significantly outperforming existing baselines. These results validate the effectiveness of multi-perspective rationale generation and cross-verification in enhancing multimodal fake news detection, offering a resilient solution to combat misinformation in the era of generative AI.

Toward Multimodal Fake News Detection by Multi-perspective Rationale Generation and Verification

In this paper, we study the data complexity of querying inconsistent weighted description logic (DL) knowledge bases under recently-introduced cost-based semantics. In a nutshell, the idea is to assign each interpretation a cost based upon the weights of the violated axioms and assertions, and certain and possible query answers are determined by considering all (resp. some) interpretations having optimal or bounded cost. Whereas the initial study of cost-based semantics focused on DLs between EL_bot and ALCO, we consider DLs that may contain inverse roles and role inclusions, thus covering prominent DL-Lite dialects. Our data complexity analysis goes significantly beyond existing results by sharpening several lower bounds and pinpointing the precise complexity of optimal-cost certain answer semantics (no non-trivial upper bound was known). Moreover, while all existing results show the intractability of cost-based semantics, our most challenging and surprising result establishes that if we consider DL-Lite^H_bool ontologies and a fixed cost bound, certain answers for instance queries and possible answers for conjunctive queries can be computed using first-order rewriting and thus enjoy the lowest possible data complexity (AC0).

Data Complexity of Querying Description Logic Knowledge Bases Under Cost-Based Semantics

We extend the seminal model of Pathak and Sönmez (2008) to a setting with multiple school districts, each running its own separate centralized match, and focus on the case of two districts. In our setting, in addition to each student being either sincere or sophisticated, she is also either constrained—able to apply only to schools within her own district of residence—or unconstrained—able to choose any single district within which to apply. We show that several key results from Pathak and Sönmez (2008) qualitatively flip: A sophisticated student may prefer for a sincere student to become sophisticated, and a sophisticated student may prefer for her own district to use Deferred Acceptance over the Boston Mechanism, irrespective of the mechanism used by the other district. We furthermore show that an unconstrained student may prefer for a constrained student to become unconstrained, regardless of the mechanisms used. Many of these phenomena appear abundantly in large random markets.

Multi-District School Choice: Playing on Several Fields

Existing feature engineering methods based on large language models (LLMs) have not yet been applied to multi-label learning tasks. They lack the ability to model complex label dependencies and are not specifically adapted to the characteristics of multi-label tasks. To address the above issues, we propose Feature Engineering Automation for Multi-Label Learning (FEAML), an automated feature engineering method for multi-label classification which leverages the code generation capabilities of LLMs. By utilizing metadata and label co-occurrence matrices, LLMs are guided to understand the relationships between data features and task objectives, based on which high-quality features are generated.The newly generated features are evaluated in terms of model accuracy to assess their effectiveness, while Pearson correlation coefficients are used to detect redundancy. FEAML further incorporates the evaluation results as feedback to drive LLMs to continuously optimize code generation in subsequent iterations. By integrating LLMs with a feedback mechanism, FEAML realizes an efficient, interpretable and self-improving feature engineering paradigm. Empirical results on various multi-label datasets demonstrate that our FEAML outperforms other feature engineering methods.

The Semantic Architect: How FEAML Bridges Structured Data and LLMs for Multi-Label Tasks

We now deploy language models in a wide variety of user-facing applications. Typically, these deployments have some specific purpose, like answering questions about documentation or acting as coding assistants, but they require general language understanding. Under these circumstances these models should not be able to answer irrelevant requests such as, poetry generation or questions about physics, etc. Instead we would like language models to only answer to queries corresponding to desired behavior and refuse all other requests, which we refer to as scoping. We conduct a comprehensive empirical evaluation of potential methods from prompting to fine-tuning to preference learning to a recently proposed method for general alignment called Circuit Breakers (CB). Across three families of language models and a broad variety of tasks, we show that it is possible to scope language models. We examine scoping for multiple topics, and fine-grained topics. We ablate diversity of irrelevant queries, layer different techniques, conduct adversarial evaluations and more. Among other results, we find that, when diverse examples of irrelevant queries are available, simple supervised fine-tuning produces the best results, but when such diversity is low, Circuit Breakers perform quite well. One can often get the benefits of both methods by layering them in succession. We intend our study to serve as a practitioner’s guide to scoping language models.

Reducing the Scope of Language Models

Machine learning under limited computational resources has gained increasing attention recently. A common yet challenging scenario is managing multiple time-constrained learning tasks with budgeted computational resources, known as Computational Resource Efficient Learning (CoRE-Learning). To this end, a recently proposed framework, Learning with Adaptive Resource Allocation (LARA), offers a preliminary approach. In this paper, we point out the limitations of LARA, including its reliance on interpolation-based extrapolation methods, the need for a fixed exploration phase, and the use of high-frequency re-estimation and reallocation strategies. To address these issues, we propose Look-ahead and immediate Resource Allocation (LaiRA). Our approach incorporates an efficient Dynamic Kalman Filtering (DKF) for look-ahead feasibility check with limited data and a weight-based online estimator for immediate performance evaluation. For resource allocation, LaiRA constructs an Upper Confidence Bound (UCB) to enable adaptive exploration and introduces an adaptive time-slicing method to reduce task switching costs. Empirical studies validate the effectiveness of our approach.

CoRE-Learning with Look-Ahead and Immediate Resource Allocation

The prevalent class imbalance in real-world graphs significantly affects the performance of Graph Neural Networks (GNNs). Existing methods for analyzing graph imbalance ignore the influence of minority nodes during the dynamic model training process, resulting in performance limitations. In this paper, we focus on minority class information during model training, identifying and defining the minority class forgetting phenomenon that exists in graph imbalanced method training processes. To address this issue, we propose the Graph Imbalance Experience Replay (GIER) Framework. On one hand, the method enhances the model's ability to mine minority node information in historical data, thereby achieving feature completion for minority class nodes. On the other hand, the proposed short-term confidence mechanism allows the model to adaptively calibrate the topological relationships in high confidence nodes, thereby mitigating the model's tendency to propagate erroneous information about minority classes during training. GIER is a unified framework consisting of two synergistic components: Long-term Subgraph Memory (LSM) constructs multi-period feature-representative subgraphs to address distribution imbalance, and Short-term Confidence Calibration (SCC) dynamically reconstructs graph topology through degree-aware node selection and confidence-based filtering to address topological imbalance. Experiments demonstrate that GIER effectively improves the classification performance of GNNs on imbalanced graphs, achieving up to a 3.44\% improvement over the state-of-the-art, and is particularly effective in extreme scenarios with very small minority classes.

GIER: Addressing Class Imbalance in GNNs Through Experience Replay

In this paper we address the problem of discovering causal relationships from observational event sequence data.
Existing methods typically assume that events are instantaneous point events, however in many real-world settings, events have duration.
For example, in healthcare, a patient's symptoms may persist over a time interval and influence clinical actions while ongoing.
To address this, we introduce a causal model for interval-based event sequences that captures rich causal structures, including interactions between events and causal mechanisms that depend on whether other events are ongoing.
We prove that our model is identifiable in the limit and present a practical causal discovery algorithm, Niagara, grounded in the algorithmic Markov condition.
To select among candidate models, we employ a minimum description length (MDL) criterion, enabling robust inference even with limited data.
We validate our approach on synthetic and real data and demonstrate its utility on a real-world medical case study, where it uncovers meaningful causal relationships from noisy, interval-based event data.

Causal Discovery from Interval-Based Event Sequences

Large Language Models (LLMs) face fundamental challenges in long-context reasoning: many documents exceed their finite context windows, while performance on texts that do fit degrades with sequence length, necessitating their augmentation with external memory frameworks. Current solutions, which have evolved from retrieval using semantic embeddings to more sophisticated structured knowledge graphs representations for improved sense-making and associativity, are tailored for fact-based retrieval and fail to build the space-time-anchored narrative representations required for tracking entities through episodic events. To bridge this gap, we propose the \textbf{Generative Semantic Workspace} (GSW), a neuro-inspired generative memory framework that builds structured, interpretable representations of evolving situations, enabling LLMs to reason over evolving roles, actions, and spatiotemporal contexts. Our framework comprises an \textit{Operator}, which maps incoming observations to intermediate semantic structures, and a \textit{Reconciler}, which integrates these into a persistent workspace that enforces temporal, spatial, and logical coherence. On the Episodic Memory Benchmark (EpBench) \cite{huet_episodic_2025} comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to \textbf{20\%}. Furthermore, GSW is highly efficient, reducing query-time context tokens by \textbf{51\%} compared to the next most token-efficient baseline, reducing inference time costs considerably. More broadly, GSW offers a concrete blueprint for endowing LLMs with human-like episodic memory, paving the way for more capable agents that can reason over long horizons.

Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces

Neural networks have emerged as powerful surrogates for solving partial differential equations (PDEs), offering significant computational speedups over traditional methods. However, these models suffer from a critical limitation: error accumulation during long-term rollouts, where small inaccuracies compound exponentially, eventually causing complete divergence from physically valid solutions. We present PhysicsCorrect, a training-free correction framework that enforces PDE consistency at each prediction step by formulating correction as a linearized inverse problem based on PDE residuals. Our key innovation is an efficient caching strategy that precomputes the Jacobian and its pseudoinverse during an offline warm-up phase, reducing computational overhead by two orders of magnitude compared to standard correction approaches. Across three representative PDE systems, including Navier-Stokes fluid dynamics, wave equations, and the chaotic Kuramoto-Sivashinsky equation, PhysicsCorrect reduces prediction errors by up to 100× while adding negligible inference time (under 5\%). The framework integrates seamlessly with diverse architectures, including Fourier Neural Operators, UNets, and Vision Transformers, effectively transforming unstable neural surrogates into reliable simulation tools that bridge the gap between deep learning's computational efficiency and the physical fidelity demanded by practical scientific applications.

Downloads

Next from AAAI 2026

Toward Multimodal Fake News Detection by Multi-perspective Rationale Generation and Verification

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES