Querying the Past: Automatic Source Attribution with Language Models

:speech_balloon: Speaker: Ryan Muther, Mathew Barber and David Smith

:classical_building: Affiliation: 1, Northeastern University, 360 Huntington Ave, Boston, MA 02115, USA; 2, Aga Khan University Institute for the Study of Muslim Civilisations, 10 Handyside St, London, GVQG 23, UK

Title: Querying the Past: Automatic Source Attribution with Language Models

Abstract: This paper explores new methods for locating the sources used to write a text by fine-tuning a variety of language models to rerank candidate sources. These methods promise to shed new light on traditions with complex citational practices, such as in medieval Arabic where citations are ambiguous and boundaries of quotation are poorly defined. After retrieving candidates sources using a baseline BM25 retrieval model, a variety of reranking methods are tested to see how effective they are at the task of source attribution. We conduct experiments on two datasets—English Wikipedia and medieval Arabic historical writing—and employ a variety of retrieval- and generation-based reranking models. In particular, we seek to understand how the degree of supervision required affects the performance of various reranking models. We find that semi-supervised methods can be nearly as effective as fully supervised methods while avoiding potentially costly span-level annotation of the target and source documents.

:newspaper: Link to paper