Corralling corpora and code

ryan.heuser · March 11, 2021, 11:38am

Hi everyone. Following this discussion on twitter about the difficulty in finding literary corpora for specific purposes, I wanted to clean up a catalogue of corpora I’ve been using and working on for a while.

It’s now here on github. 32 corpora, most of which have free or downloadable components. Some have components which are locked behind instutional or other paywalls; but, provided you find your own access to those data, there’s code here to wrangle them into more plyable forms.

There’s also code for using each corpus:

#install: pip install -qU git+https://github.com/quadrismegistus/lltk
import lltk

# load a corpus (will automatically download if nec.)
C = lltk.load('TxtLab')

# get metadata
df = C.meta

# get common data, e.g. doc-term matrix of top 1000 nouns
dtm = C.dtm(n=1000, only_pos={'n*'})

The idea is kind of like NLTK’s: corpora on LLTK can live there in standardized form for easy and quick access anywhere (e.g. through Colab, etc). Custom or private corpora also work.

I’d also love to talk more about how we can connect these and other corpora together through Linked Open Data (to e.g. Wikidata).

I’m posting this not as a finished tool but so I can stop procrastinating on it and go back to some other stuff I need to do!

All best,
Ryan

folgert · March 12, 2021, 10:46am

This is great! Thanks for sharing

fpianz · March 15, 2021, 9:02pm

Amazing! Thanks for sharing.

I’m very interested in making them LOD. I’m actually working on linking various databases of metadata about fiction… and I hope to eventually close the loop and link them to Wikidata.

There’s also a project about a Computational Literary Studies infrastructure, but I don’t know the details. Maybe some other people here can help: @mike.kestemont @ash

ash · March 22, 2021, 5:45pm

Not sure about LOD within the project (there are definitely some deliverables revolving around corpus preparations), but will just mention here the ELTeC corpora (~100 novels per European tradition, 19-mid 20th c.): it has vague genre annotation rules, but it may be integrated in the greater scheme of things at some point

ryan.heuser · March 23, 2021, 11:10am

Cool, thanks for sharing! When I have time, and if the ELTeC project contributors don’t mind, I’ll try to fuse the various corpora into a single LLTK corpus. Though it would obv be difficult to do a lot of usual NLP stuff on so many languages, it’d still be doable in a lot of cases, and then there are non-language-specific questions and measurements to explore as well.

ash · March 30, 2021, 2:05pm

Just checked your package, Ryan, and wow, this is some good stuff!
I have been getting “KeyError” for some of the collections, though (Hathi, Gale and others). Is it expected?
(Running the library from a cloud notebook)

ryan.heuser · March 31, 2021, 8:57pm

Cool, thanks for testing this out @ash! That was indeed a bug. I’ve spent some time cleaning things up and I think I’ve squashed it. You can test it out here.

The only exception is the “Hathi” corpus, which isn’t really possible to use on its own. (Its metadata file is 1GB zipped!) It’s really just a shell corpus for others (HathiNovels, HathiTales, etc), which at first search through the giant Hathi metadata file to find texts to start up with. I’m going to try to make all that cleaner and friendlier to use soon, though.