Speaker: David A. Smith, Jacob Murel, Jonathan Parkes Allen and Matthew Thomas Miller
Affiliation: 1, Khoury College of Computer Sciences, Northeastern University, Boston MA, U.S.A.; 2, Roshan Institute for Persian Studies, University of Maryland, College Park MD, U.S.A.
Title: Automatic Collation for Diversifying Corpora: Commonly Copied Texts as Distant Supervision for Handwritten Text Recognition
Abstract: Handwritten text recognition (HTR) has enabled many researchers to gather textual evidence from the human record. One common training paradigm for HTR is to identify an individual manuscript or coherent collection and to transcribe enough data to achieve acceptable performance on that collection. To build generalized models for Arabic-script manuscripts, perhaps one of the largest textual traditions in the pre-modern world, we need an approach that can improve its accuracy on unseen manuscripts and hands without linear growth in the amount of manually annotated data. We propose Automatic Collation for Diversifying Corpora (ACDC), taking advantage of the existence of multiple manuscripts of popular texts. Starting from an initial HTR model, ACDC automatically detects matching passages of popular texts in noisy HTR output and selects high-quality lines for retraining HTR without any manually annotated data. We demonstrate the effectiveness of this approach to distant supervision by annotating a test set drawn from a diverse collection of 59 Arabic-script manuscripts and a training set of 81 manuscripts of popular texts embedded within a larger corpus. After a few rounds of ACDC retraining, character accuracy rates on the test set increased by 19.6 % absolute percentage, while a supervised model trained on manually annotated data from the same collection increased accuracy by 15.9 % . We analyze the variation in ACDC’s performance across books and languages and discuss further applications to collating manuscript families.