Greetings from! Extracting address information from 100,000 historical picture postcards

:speech_balloon: Speaker: Thomas Smits, Wouter Haverals, Loren Verreyen, Mona Allaert and Mike Kestemont

:classical_building: Affiliation: 1, Antwerp Center for Digital Humanities and Literary Criticism (ACDC), University of Antwerp, Belgium; 2, Institute for the Study of Literature in the Low Countries (ISLN), University of Antwerp, Belgium; 3, Center for Digital Humanities (CDH), Princeton University, USA; 4, Amsterdam School of Historical Studies (ASH), University of Amsterdam, the Netherlands

Title: Greetings from! Extracting address information from 100,000 historical picture postcards

Abstract: This paper details the development and validation of computational methods aimed at creating a comprehensive dataset from a vast collection of historical picture postcards. The dataset associated with this research can be accessed at DOI: 10.5281/zenodo.10005566 10.5281/zenodo.10005566 . It is open for everyone to explore and build upon, provided proper attribution to this paper is given. By connecting three distinct locations – the sender’s, the recipient’s, and the depicted – the medium of the picture postcard has contributed to the formation of extensive spatial networks of information exchange. So far, the analysis of these spatial networks was hampered by the fact that picture postcards are – literally and figuratively – hard to read. Using traditional methods, transcribing and analyzing a sizeable number of postcards would take a lifetime. To address this challenge, this paper presents a pipeline that leverages Computer Vision, Handwritten Text Recognition, and Large Language Models to extract and disambiguate address information from a collection of 102K historical postcards sent from Belgium, France, Germany, Luxembourg, the Netherlands, and the UK. We report a mAP of 0.94 for the CV model, a character error rate of 7.62 % , and a successful extraction rate of 419 coordinates from an initial sample set of 500 postcards for the LLM. Overall, our pipeline demonstrates a reliable address information extraction rate for a significant proportion of the postcards in our data (with an average distance difference between the HTR-determined addresses and the Ground Truth text of 36.95km). Deploying our pipeline on a larger scale, we will be able to reconstruct the spatial networks that the medium of the postcard enabled.

:newspaper: Link to paper