Our new article Geographical and linguistic perspectives on developing geoparsers with generic resources is now published in the International Journal of Geographical Information Science!
We explore what it takes to create a “geoparser”, a system for locating place names from texts, for a morphologically complex language, such as Finnish. We touch upon how the language of choice affects geoparsing and how those hurdles might be overcome by using generic tools, such as the NLP library spaCy. Measuring the performance of geoparsers is crucial for knowing what works, what needs improvement and comparing geoparsers – we contribute to this aspect by sharing a Tweet and a news corpus of annotated Finnish texts, and by proposing a new evaluation metric.
As an outcome of the study, we also publish a Python library and a pretrained language model for geoparsing Finnish texts. You can find Finger (the FINnish GeoparsER) with installation and use instructions in this GitHub repository.
This work was carried out in the MOBICON project, funded by the Kone foundation.
Abstract:
Geoparsers aim to find place names in unstructured texts and locate them geographically. This process produces georeferenced data usable for spatial analyses or visualisations. Much geoparsing research and development has thus far focused on the English language, yet languages are not alike. Geoparsing them may necessitate language-specific processing steps or data for training geoparsing systems. In this article, we applied generic language and GIS resources to geoparsing Finnish texts. We argue that using generic resources can ease the development of geoparsers, and free up resources to other tasks, such as annotating evaluation corpora. A quantitative evaluation on new human-annotated news and tweet corpora indicates robust overall performance. A systematic analysis of the geoparser output reveals errors and their causes at each processing step. Some of the causes are specific to Finnish, and offer insights to geoparsing other morphologically complex languages as well. Our results highlight how the language of the input text affects geoparsing. Additionally, we argue that toponym resolution metrics based on error distance have limitations, and proposed metrics based on spatial intersection with ground-truth polygons.
Citation:
Leppämäki, T., Toivonen, T., & Hiippala, T. (2024). Geographical and linguistic perspectives on developing geoparsers with generic resources. International Journal of Geographical Information Science, 1–22. https://doi.org/10.1080/13658816.2024.2369539