Counterfactual Evaluation of Peer Review Assignment Strategies in Computer Science and Artificial Intelligence

Steven Jecmen

UNDERLINE DOI: https://doi.org/10.48448/fc2c-tb69

poster

Peer Review Congress 2022

•

September 09, 2022

•

Chicago, United States

Counterfactual Evaluation of Peer Review Assignment Strategies in Computer Science and Artificial Intelligence

keywords:

peer review

statistics

artificial intelligence

Objective Artificial intelligence (AI) has become pervasive to assign reviewers to papers.1 The assignment relies on 3 key sources of data1: (1) AI-computed similarities between the text of the submitted paper and reviewers’ past articles, (2) reviewer-provided preferences expressing which papers they would like to review, and (3) overlap between the paper’s topics as specified by authors and reviewers’ self-reported areas of expertise. However, it is unknown which of these sources, or combination thereof, lead to the best outcomes of the reviewer assignment.

Design To assign reviewers to papers, 2 venues recently used randomized algorithms2 designed to combat fraud: the 2021 Theory and Practice of Differential Privacy (TPDP) Workshop with 35 reviewers and 95 full papers and the Association for the Advancement of Artificial Intelligence (AAAI) 2022 Conference on Advancement in Artificial Intelligence with 3145 reviewers and 8450 full papers. To compute overall similarities between each reviewer-paper pair, TPDP weighted the AI-computed text similarities by weight (wtext, range 0-1) and reviewers’ preferences by weight (1 − wtext); AAAI weighted the AI-computed text similarities by weight (wtext, range 0-1) and the overlap between the papers and reviewers’ topical areas by weight (1 − wtext) (reviewers’ preferences were also included in AAAI but not considered in this study). The randomized assignment2 then maximized similarity of assigned reviewer-paper pairs, subject to the probability of any reviewer being assigned to any paper being at most 0.5 in TPDP and 0.52 in AAAI. In this study, the randomization in the assignment was leveraged to estimate the counterfactual quality of alternative assignment strategies. How the overall quality of the reviewer-paper assignment was affected was investigated by (1) introducing randomness in the assignment process and (2) varying weights of different sources of information. The quality of any counterfactual reviewer-paper assignments was measured using reviewers’ self-reported expertise and confidence in their review.

Results The results are tabulated in Table 26. 3 First, introducing randomness by limiting the probability of any reviewer-paper assignment led to a marginal reduction in assignment quality for TPDP and a slightly larger reduction in AAAI. Second, for TPDP, placing more weight on the AI- computed text similarities (wtext = 0.8) instead of equally weighting the text similarities and the reviewers’ preferences (wtext = 0.5) resulted in a higher reviewer-paper assignment quality. Third, for AAAI, placing more weight on the AI- computed text similarities (wtext = 0.75) instead of equally weighting the text similarity and the reviewer-paper topical area overlap (wtext = 0.5) led to a similar assignment quality.

Conclusions Randomness in the reviewer assignments can help improve AI-based automated assignment by enabling counterfactual analysis of alternative assignment strategies, in addition to its original goal of mitigating fraud, but leads to a small reduction in assignment quality.

References 1. Shah N. Challenges, experiments, and computational solutions in peer review. Commun ACM. 2022;65(6):76-87. doi:10.1145/3528086

2. Jecmen S, Zhang H, Liu R, Shah N, Conitzer V, Fang F. Mitigating manipulation in peer review via randomized reviewer assignments. Adv Neural Inf Process Syst. 2020;33:12533-12545.

3. Imbens GW, Manski CF. Confidence intervals for partially identified parameters. Econometrica. 2004;72(6):1845-1857. doi:10.1111/j.1468-0262.2004.00555.x

Conflict of Interest Disclosures None reported.

Funding/Support This work was supported by the US National Science Foundation CAREER award (1942124) which supports research on the fundamentals of learning from people with applications to peer review.

Acknowledgments We thank Gautam Kamath and Rachel Cummings for allowing us to conduct this study in TPDP and Melisa Bok and Celeste Martinez Gomez from OpenReview.net for helping with the APIs of OpenReview.net.