Singapore

Pairwise evaluation of large language models (LLMs) has become the dominant paradigm for benchmarking open-ended tasks, yet non-transitive preferences—where evaluators prefer A over B, B over C, but C over A—fundamentally undermine ranking reliability. We show that this critical issue stems largely from low-quality data that contains inherently ambiguous preference pairs. To address this challenge, we propose ELSPR, a principled graph-theoretic framework that models pairwise preferences as tournament graphs and systematically identifies problematic training data. ELSPR quantifies non-transitivity through strongly connected components (SCCs) analysis and measures overall preference clarity using a novel normalized directed graph structural entropy metric. Our filtering methodology selectively removes preference data that induce non-transitivity while preserving transitive preferences. Extensive experiments on the AlpacaEval benchmark demonstrate that models fine-tuned on ELSPR-filtered data achieve substantial improvements: a 13.8\% reduction in non-transitivity, a 0.088 decrease in structural entropy, and significantly enhanced discriminative power in real-world evaluation systems. Human validation confirms that discarded data exhibit dramatically lower inter-annotator agreement (34.4\% vs. 52.6\%) and model-human consistency (51.2\% vs. 80.6\%) compared to cleaned data. These findings establish ELSPR as an effective data self-purification approach for developing more robust, consistent, and human-aligned LLM evaluation systems.

AAAI 2026

ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction

learning preferences or rankings

learning human values and preferences

applications

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Advanced image generative models have led to concerns about malicious use, underscoring the necessity for generalizable detection methods. However, existing approaches tend to overfit to domain-specific forgery patterns, while overlooking complementary cues from different domains. Therefore, we introduce DySy-Det (Dynamic Synergy Detector), a novel framework that mines collaborative and robust forgery artifacts from multiple evidence domains. First, DySy-Det fine-tunes a CLIP vision transformer to extract high-level semantics for identifying conceptual inconsistencies, while generating attention maps that pinpoint key discriminative regions. Then, this semantic guidance, in the form of a mask, directs a targeted reconstruction process. By focusing on these salient areas, our approach effectively extracts localized reconstruction errors, thereby filtering out irrelevant background noise. Furthermore, inspired by the intrinsic generative mechanics of diffusion models, we introduce the concept of Reconstruction-Path Consistency (RPC), which quantifies the temporal stability of the denoising trajectory to expose dynamic generative artifacts. We capture this by computing noise alignment scores across multiple timesteps and encode them via a lightweight network. Extensive evaluations on GenImage and UniversalFakeDetect benchmarks demonstrate that DySy-Det outperforms the state-of-the-art detector by 6.14% and 1.57% in mean accuracy, respectively.

DySy-Det: A Synergistic Framework with Dynamic Reconstruction-Path Consistency for AI-Generated Image Detection

Retrieval-augmented generation (RAG) has been extensively employed to mitigate hallucinations in large language models (LLMs). However, existing methods for multi-hop reasoning tasks often lack global planning, increasing the risk of falling into local reasoning impasses. Insufficient exploitation of retrieved content and the neglect of latent clues fail to ensure the accuracy of reasoning outcomes. To overcome these limitations, we propose **R**ecursive **E**valuation and **A**daptive **P**lanning (REAP), whose core idea is to explicitly maintain structured sub-tasks and facts related to the current task through the Sub-task Planner (SP) and Fact Extractor (FE) modules. SP maintains a global perspective, guiding the overall reasoning direction and evaluating the task state based on the outcomes of FE, enabling dynamic optimization of the task-solving trajectory. FE performs fine-grained analysis over retrieved content to extract reliable answers and clues. These two modules incrementally enrich a logically coherent representation of global knowledge, enhancing the reliability and the traceability of the reasoning process. Furthermore, we propose a unified task paradigm design that enables effective multi-task fine-tuning, significantly enhancing SP's performance on complex, data-scarce tasks. We conduct extensive experiments on multiple public multi-hop datasets, and the results demonstrate that our method significantly outperforms existing RAG methods in both in-domain and out-of-domain settings, validating its effectiveness in complex multi-hop reasoning tasks.

REAP: Enhancing RAG with Recursive Evaluation and Adaptive Planning for Multi-Hop Question Answering

Inequality measures such as the Gini coefficient are used to inform and motivate public policymaking, and are increasingly applied to digital platforms.
We analyze how measures fare in pseudonymous settings, as common to internet-based or blockchain-based platforms.
One key challenge that arises is the ability of actors to create multiple fake identities under fictitious false names, also known as ``Sybils.''
While some actors may do so to preserve their privacy, we show that this can inadvertently distort inequality metrics.
We prove a set of impossibilities for Sybil-proof measures that simultaneously satisfy subsets of the literature's canonical set of desired properties, and show that a wide range of commonly used measures are indeed sensitive to Sybil manipulations, including the famous Gini coefficient.
We present several classes of Sybil-proof measures, and, by fully characterizing them, we prove that the structure imposed restricts their ability to assess inequality at a fine-grained level.
In addition, we examine which popular inequality metrics are vulnerable to Sybil manipulations and the dynamics that result in the creation of Sybils, whether in pseudonymous settings or traditional ones.

Inequality in the Age of Pseudonymity

Sharpness-Aware Minimization (SAM) has been proven to be an effective optimization technique for improving generalization in overparameterized models. While prior works have explored the implicit regularization of SAM in simple two-core scale-invariant settings, its behavior in more general tensorized or scale-invariant models remains underexplored. In this work, we leverage scale-invariance to analyze the norm dynamics of SAM in general tensorized models. We introduce the notion of Norm Deviation as a global measure of core norm imbalance, and derive its evolution under SAM using gradient flow analysis. We show that SAM's implicit control of Norm Deviation is governed by the covariance between core norms and their gradient magnitudes. Motivated by these findings, we propose a simple yet effective method, Deviation-Aware Scaling (DAS), which explicitly mimics this regularization behavior by scaling core norms in a data-adaptive manner. Our experiments across tensor completion, noisy training, model compression, and parameter-efficient fine-tuning confirm that DAS achieves competitive or improved performance over SAM, while offering reduced computational overhead.

Unpacking the Implicit Norm Dynamics of Sharpness-Aware Minimization in Tensorized Models

Cartesian abstractions can flexibly approximate planning tasks to generate admissible heuristic functions. Constrained abstractions use state constraints, such as mutexes, to eliminate parts of the abstraction that cannot belong to solutions for the original problem. While this has been successfully applied to simple forms of abstraction, no previous work has explored how to do this for Cartesian abstractions.

We introduce constrained Cartesian abstractions, which leverage state constraints in multiple ways: to prune spurious transitions and to simplify or even remove abstract states. Moreover, we also use disambiguation to better guide the counterexample-guided process used to generate the abstractions. Our experimental results show that the resulting constrained Cartesian abstractions induce more informed heuristics than their non-constrained counterpart.

Not Everything Is Permitted: Constrained Cartesian Abstractions for Optimal Classical Planning

We introduce a new task of open-world object counting in videos: given a text description, or an image example, that specifies the target object, the objective is to enumerate all the unique instances of the target objects in the video. This task is especially challenging in crowded scenes with occlusions and objects of similar appearance, where avoiding double counting and identifying reappearances is crucial. To this end, we make the following contributions: we introduce a model, CountVid, for this task. It leverages an image-based counting model, and a promptable video segmentation and tracking model, to enable automated open-world object counting across video frames. To evaluate its performance, we introduce VideoCount, a new dataset for this novel task built from the TAO and MOT20 tracking datasets, as well as from videos of penguins and metal alloy crystallization captured by x-rays. Using this dataset, we demonstrate that CountVid provides accurate object counts, and significantly outperforms strong baselines. The VideoCount dataset, the CountVid model, and all the code will be publicly released.

Open-World Object Counting in Videos

Multi-unit bilateral trade refers to the setting, where there is a buyer and a seller, who holds a finite number of units of an indivisible item. An automated mechanism has to decide how many units are transferred from the seller to the buyer and the corresponding payment from the buyer to the seller. The buyer and the seller have both either increasing or increasing submodular valuation functions in the number of units in possession. The (single-unit) bilateral trade problem arises as a particular case.

We study the problem of social welfare maximisation by establishing the fraction (\emph{approximation ratio}) of the optimal social welfare that a fixed-price mechanism can recover. Fixed-price mechanisms, understood as per-unit price in the multi-unit setting, have been characterised as the only truthful, individually rational and strongly budget balanced mechanisms by (Gerstgrasser et al. 2019) and (Hagerty and Rogerson 1987). We narrow the gap on the approximation ratio of optimal fixed-price mechanisms for bilateral trade, which has been shown to lie between $0.72$ and $0.7381$ by (Cai and Wu 2023). We show that it must lie between $0.728$ and $0.73805$, which leads to improved bounds on the approximation ratio of optimal fixed-price mechanisms for multi-unit bilateral trade. In particular, we show that multi-unit bilateral trade is at least as hard as single-unit bilateral trade, and obtain several hardnesses for different numbers of units.

On the Approximation Ratio of Optimal Fixed-Price Mechanisms for Single and Multi-Unit Bilateral Trade

Probabilistic decoding in Large Language Models (LLMs) often yields inconsistent outputs, particularly on complex or long-form questions. Self-Consistency (SC) mitigates this for short-form QA by majority voting over exact strings, whereas Universal Self-Consistency (USC) and Weighted Unigram Consistency Score (WUCS) extend to long-form responses but lose accuracy on short-form benchmarks. 

We introduce $\textbf{Latent Self-Consistency (LSC)}$, which selects the most semantically consistent response using learnable token embeddings. A lightweight forward generation of summary tokens increases inference time by less than 1% and requires no changes to the model architecture.

Across 6 short-form and 5 long-form reasoning benchmarks (e.g., MATH, MMLU, TruthfulQA), LSC surpasses SC, USC and WUCS on all short-form and long-form ones on average, while maintaining negligible computational overhead. These results position LSC as a practical consistency-selection method that works reliably across answer formats.
Additionally, LSC provides well-calibrated confidence estimates, maintaining low Expected Calibration Error across both answer formats.

Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning

Large language models (LLMs) have significantly transformed natural language understanding and generation, but they raise privacy concerns due to potential exposure of sensitive information. Studies have highlighted the risk of information leakage, where adversaries can extract sensitive information embedded in the prompts. In this work, we introduce a novel private prediction framework for generating high-quality synthetic text with strong privacy guarantees. Our approach leverages the Differential Privacy (DP) framework to ensure worst-case theoretical bounds on information leakage without requiring any fine-tuning of the underlying models.The proposed method performs inference on private records and aggregates the resulting per-token output distributions. This enables the generation of longer and coherent synthetic text while maintaining privacy guarantees. Additionally, we propose a simple blending operation that combines private and public inference to further enhance utility. Empirical evaluations demonstrate that our approach outperforms previous state-of-the-art methods on in-context-learning (ICL) tasks, making it a promising direction for privacy-preserving text generation while maintaining high utility.

Privacy Preserving In-Context-Learning Framework for Large Language Models

Accurate 3D vehicle pose and shape reconstruction from monocular images remains a formidable challenge for autonomous driving, particularly for distant, occluded, or small objects. Existing methods often suffer from geometric ambiguity in depth estimation and structural hollowness in shape recovery, primarily due to inadequate multi-scale feature aggregation and inflexible prior modeling. To overcome these limitations, a novel framework termed MonoVPR is proposed by integrating dynamic context adaptation and progressive geometry refinement. Specifically, a Hierarchical Dual-Context Attention (HDCA) module is introduced to resolve scale-dependent degradation through gated cross-attention across multi-resolution feature maps, dynamically fusing object-centric geometric cues with scene-centric semantics. For shape refinement, the Bounded Iterative Mesh Refiner (BIMR) is developed, where template-guided deformations are progressively optimized via multi-head deformable attention and a tanh-bounded correction loop, ensuring physically plausible reconstructions. Extensive experiments on the ApolloCar3D benchmark demonstrate MonoVPR achieves state-of-the-art performance, showcasing exceptional capability in reconstructing geometrically consistent shapes and precise poses for challenging long-range and occluded scenarios.

Downloads

Next from AAAI 2026

DySy-Det: A Synergistic Framework with Dynamic Reconstruction-Path Consistency for AI-Generated Image Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

DySy-Det: A Synergistic Framework with Dynamic Reconstruction-Path Consistency for AI-Generated Image Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads