technical paper
Development of a Global Data Set for Peer Review in Astronomy
keywords:
open science
peer review
artificial intelligence
Objective The great astronomical observatories accept
thousands of proposals per year from astronomers hoping to
receive telescope time. Specifically, the Space Telescope
Science Institute receives approximately 1000 proposals per
year for the Hubble Space Telescope, with this number
projected to double as the James Webb Space Telescope has
safely launched. 2 In astronomy, a Time Allocation Committee
(TAC) reviews all proposals submitted for the use of a
telescope and identifies the proper expert to review the
proposal. The goal of the study was to develop a database of
all active astronomers and their publications that assists in
the identification of experts for the peer review of observing
proposals, expanding on work done by Kerzendorf et al 1 and
Strolger et al. 2
Design The database creation and modeling study has
expanded the reviewer pool to all around the world, instead of
simply relying on the TAC’s personal networks. The Semantic
Scholar Open Research Corpus (S2ORC) data set allowed for
the creation of a preliminary database consisting of authors,
their full-text publications, and associated metadata. The
identification of experts for peer review was systematically
done by leveraging an astronomer’s body of work (ie, scientific
publications). An author’s publications and the observing
proposal were numerically represented using machine
learning models to identify which astronomer’s expertise is
similar for review of the proposal. Various methods were
compared to disambiguate author names using name-based
techniques. However, authors with full names having more
than 3 words were excluded owing to formatting issues
(currently investigating methods to address the issue). A
preliminary prototype using machine learning and natural
language processing models was tested using 918 proposals
from the European Southern Observatory (significant metrics
to evaluate expertise are being researched).
Results The S2ORC data set, which consists of 12 million
full-text publications, was filtered to only astronomy
publications using publication arXiv identifiers. The database
contains 212,839 publications and a total of 1,801,916
nonunique authors from 1991 to 2020. Three author name
disambiguation algorithms were compared: first initial, all
initials, and hybrid method. 3 The 3 methods were validated
using an initial subset of 1538 ORCID identifiers matched to
astronomers. A contamination rate is the percentage of
validated astronomers whose identity became compromised
due to merging or splitting of names. The contamination
rates of the 3 methods were 1.77%, 15.52%, and 2.02%,
respectively.
Conclusions The developed database has expanded the
possible reviewer pool from several hundreds known to the
TAC to all active astronomers worldwide. A larger pool of
reviewers allows for more accurate expertise matching.
References
1. Kerzendorf WE, Patat F, Bordelon D, van de Ven G,
Pritchard TA. Distributed peer review enhanced with natural
language processing and machine learning. Nat Astron.
2020;4(7):711-717. doi:10.1038/s41550-020-1038-y
2. Strolger LG, Porter S, Lagerstrom J, Weissman S,
Reid IN, Garcia M. The Proposal Auto-Categorizer and
Manager for time allocation review at Space Telescope
Science Institute. AJ. 2017;153(4):181. doi:10.3847/1538-
3881/aa6112
3. Milojević S. Accuracy of simple, initials-based
methods for author name disambiguation. J Informetrics.
2013;7(4):767773. doi:10.1016/j.joi.2013.06.006
Conflict of Interest Disclosures None reported.
Funding/Support Funding was received from the Space Telescope
Science Institute–Hubble Space Telescope Science Policies Group.
Role of Funder/Sponsor The funder collaborated with members
from the Space Telescope Science Institute to oversee the
development of the database.