poster
Global Gender Estimation From Distribution of First Names
keywords:
diversity and inclusion
publication
bias
Objective By construction, current methods of gender
estimation portray gender-skewed populations as more
gender-balanced than they truly are.1,2 This systematic bias
always underplays issues of underrepresentation, whereby
one gender has a minority representation of less than 50%.
A global method to estimate the gender composition of a
population from correlations with first names was introduced
that is free of systematic errors.3 The method will improve our
understanding of the review process and enhance analytics
tools used in science.
Design Determining gender composition of a group from
first names requires prior knowledge of name-gender
correlations from a reference population. Current gender-
estimation methods assume that name-gender conditional
probabilities can be directly transferred from a reference
population to a target population. This strong assumption
means that one population must be a fair sample of the other,
particularly in gender composition, implying that
conventional methods will fail for strong gender asymmetry.
A global gender estimator method (gGEM) was derived that
instead quantifies how reference conditional probabilities
must transform to best describe the observed list of names.
The transformation, based on a process that morphs one
population into another and seeks a self-consistent solution
using the complete list of names, frees the estimation process
from the fair-sampling assumption while also quantifying the
strength of the otherwise hidden gender-dependent social
process. Public data containing more than 200,000 names
from 3 countries (40% from the US, 35% from Brazil, and
25% from France) were used as reference populations, from
which prescribed fractions of men or women were removed to
construct test populations of various gender compositions.
The estimation method was compared with conventional
approaches using these well-controlled test populations.
A limitation is that the method is as accurate as the
correlation between names and gender given by reference
data.
Results gGEM provided accurate estimates irrespective of
gender composition. It was observed that previous methods
produced estimates that deviated linearly from the correct
values as the gender mix deviated from gender balance. In the
extreme case of a highly skewed test population composed of
1% women (correctly estimated by gGEM), previous methods
estimated 3% to 2% prevalence of women depending on
whether names with unclear gender were considered or not,
respectively—a systematic error of at least 100% of the correct
prevalence. gGEM showed no observable systematic effect for
every gender mix tested. Typically, conventional methods
incur systematic inaccuracy that grows quickly if the fraction
of the underrepresented gender falls below 20 individuals per
100 people.
Conclusions When estimating the gender profile from first
names, the global estimation method proposed here, which is
easily implemented, should become the method of choice.
Furthermore, it is argued that merging available reference
populations with little overlap is a good strategy to mitigate
errors stemming from population mismatching.
References
1. Ross CO, Gupta A, Mehrabi N, Muric G, Lerman K. The
leaky pipeline in physics publishing. arXiv. Preprint posted
online October 18, 2020. doi:10.48550/arXiv.2010.08912
2. Squazzoni F, Bravo G, Farjam M, et al. Peer review and
gender bias: a study on 145 scholarly journals. Sci Adv.
2021;7(2):eabd0299. doi:10.1126/sciadv.abd0299
3. gGEM. Home page. https://www.ggem.app
Additional Information Alessandro S. Villar and Hugues Chaté
are co–corresponding authors.