Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Scalable oversight protocols aim to empower evaluators to verify the output of AI models more capable than themselves. However, human evaluators are subject to biases that can lead to systematic errors. In a reanalysis of prior work that appeared to demonstrate the efficacy of simple protocols, we find that human evaluators possessing knowledge absent from models likely contributed to their positive results, which is an advantage that diminishes as models continue to scale in capability. We also report the results of two experiments examining the performance of simple oversight protocols where evaluators know that the model is "correct most of the time, but not all of the time'', finding no overall advantage for the tested protocols. In our main experiment, participants in both groups became more confident in the system’s answers after conducting online research, even when those answers were incorrect. These findings underscore the importance of testing the degree to which oversight protocols are robust to evaluator biases, whether they outperform a strategy of simple deference to the model being evaluated, and whether their performance scales with increasing problem difficulty and model capability.
