
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

poster
Basreh or Basra? Geoparsing Historical Locations in the Svoboda Diaries
keywords:
geotagging
alternate names
geoparsing
geocoding
small corpus
digital humanities
named entity linking
named entity recognition
Geoparsing, the task of assigning coordinates to locations extracted from free text, is invaluable in enabling us to place locations in time and space. In the historical domain, many geoparsing corpora are from large news collections. We examine the Svoboda Diaries, a small historical corpus written primarily in English, with many location names in transliterated Arabic. We develop a pipeline employing named entity recognition for geotagging, and a map-based generate-and-rank approach incorporating name augmentation and clustering of location context words for geocoding. Our system outperforms existing map-based geoparsers in terms of correct location identification and lowest mean distance error. As location names may vary from those in knowledge bases, we find that augmented candidate generation is instrumental in the system's performance. Among our candidate generation methods, the generation of translated names contributed the most to increased location matches in the knowledge base. Our main contribution is proposing an integrated pipeline for geoparsing of historical corpora using augmented candidate location name generation and clustering methods -- an approach that can be generalized to other texts with foreign or non-standard spellings.