AAAI 2026

January 24, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Scalable oversight protocols aim to empower evaluators to verify the output of AI models more capable than themselves. However, human evaluators are subject to biases that can lead to systematic errors. In a reanalysis of prior work that appeared to demonstrate the efficacy of simple protocols, we find that human evaluators possessing knowledge absent from models likely contributed to their positive results, which is an advantage that diminishes as models continue to scale in capability. We also report the results of two experiments examining the performance of simple oversight protocols where evaluators know that the model is "correct most of the time, but not all of the time'', finding no overall advantage for the tested protocols. In our main experiment, participants in both groups became more confident in the system’s answers after conducting online research, even when those answers were incorrect. These findings underscore the importance of testing the degree to which oversight protocols are robust to evaluator biases, whether they outperform a strategy of simple deference to the model being evaluated, and whether their performance scales with increasing problem difficulty and model capability.

Downloads

Paper

Next from AAAI 2026

The Alignment Game: A Theory of Long-Horizon Alignment Through Recursive Curation
poster

The Alignment Game: A Theory of Long-Horizon Alignment Through Recursive Curation

AAAI 2026

+1
Ali Falahati and 3 other authors

24 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved