Corpora Obscura

folgert · April 19, 2020, 8:10am

In my research into cultural change, it is important not to focus too much on a few
canonized data sets. That is why I am always looking for interesting, new or forgotten,
relatively unknown data sets. On my recent travels around the Internet, I collected the
following list. I call it “Corpora Obscura”:

Auction Catalogs - Collection of an estimated 30,000 auction catalogs published from 1744 to the present by auction houses around the world.
Amsterdam in WW II - Data set about where and when bombs were dropped on Amsterdam in the 1940-1945 period.
Bigfoot Sightings - Full text and geocoded sighting reports from the Bigfoot Field Researchers Organization (BFRO).
Coin Production in the Low Countries - Datasets on coin production in the Southern and Northern Low Countries (present-day Netherlands, Belgium and Luxembourg) that were compiled by various historians over the past few decades.
Comic book route - Location of comic book walls of the City of Brussels (with characters and authors).
Convict Tattoo Descriptions - Descriptions of tattoos of at least 60,000 convicts in the Old Bailey proceedings.
Danish Folklore Fieldtrips - Historical GIS data of the Danish folklore collection of Evald Tang Kristensen (1843-1929).
Death Row Last Statements - Dataset of the Texas (US) Departement of Criminal Justice with Last Statements of people on Death Row.
English Jokes - A dataset of 200k English plaintext jokes.
Pudding data - Data sets created for stories on The Pudding, open to the public.
Red Riding Hood - A corpus of more than 400 Dutch retellings of “Little Red Riding Hood” (1780-2015). (This one is mine, actually )
South Park Script Data - CSV files containing script information including: season, episode, character, & line.
The Paper Chain Letter Archive - Transcriptions of paper chain letters from a long historical period.
The Tate Collection - Metadata for around 70,000 artworks housed by the Tate galleries.
UFO Sightings - 80,000 UFO sighting reports for approximately a century of data.
Vincent van Gogh: the Letters - All the surviving letters written and received by Vincent van Gogh (1853-1890) in XML.
Witch Trials - Data set on witch trials in Europe.
Women in Film - Data collection on how women are portrayed in film.

Maybe there is something for you here, or, perhaps you have some interesting and
preferably obscure additions?

l.fonteyn · April 22, 2020, 9:49am

This is great!

Perhaps nice to mention Nini’s Ripper letters corpus here:

“This corpus consists of the letters or postcards found and transcribed in the Appendix of Evans and Skinner (2001), who claim to have collected all of the texts involved in the Whitechapel murders related to Jack the Ripper from the Metropolitan Police files. These letters were OCR-scanned from the book and the scans were manually checked for scanning errors. The corpus consists of 209 texts and 17,463 word tokens. The average length of a text in the corpus is of eighty-three tokens (min = 7, max = 648, SD = 67.4).”

Obviously, not all letters were written by Jack the Ripper (imagine writing 209 letters in a year’s time!), if there ever even was such a person, as Nini (2018) explains.

melvin.wevers · April 22, 2020, 10:35am

Pitchfork reviews - A dataset with ~18K music reviews and grades from popular music website Pitchfork. There’s also an unofficial API to scrape new reviews

folgert · April 23, 2020, 6:19am

Nice. Thanks! As the list grows, we might think about turning it into some kind of (awesome) wiki.

alberto.acerbi · April 28, 2020, 12:15pm

I stumbled upon this collection a few weeks ago. Not all datasets are suitable bur there may be something interesting. If someone has time to flag the good ones that would be great!

folgert · April 28, 2020, 12:28pm

That looks nice. I’ll look into it!

ash · April 28, 2020, 2:55pm

TidyTuesday datasets are published weekly and are mainly for practice in dataviz with R, but there are some fascinating ones (from injuries in US theme-parks to anime!)

mjlavin80 · April 29, 2020, 5:49pm

Hey all, please excuse the self-promotion, but I’ve been maintaining a tagged/searchable collection of links to open datasets for the humanities at https://humanitiesdata.com for a few years now, and you might find it interesting. I’m usually looking for things to include as well. Some of these are new to me, so I’ll gladly index them if no one objects. If anyone wants to collaborate, I’m considering adding a feature to the site called “collections.” These would be smaller groups of datasets by topic, with a brief introductory narrative. Comment or message me if you’re interested.

l.fonteyn · June 9, 2020, 2:56pm

No one should be embarrassed about self-promotion with a contribution this awesome!!

thomas.haider · June 27, 2020, 9:33pm

The literotica corpus with over 110000 texts of indecent fanfiction that I crawled some time back: https://github.com/tnhaider/literotica-corpus

folgert · September 20, 2020, 7:32am

Found another nice one: all datasets used by the pudding.cool: https://github.com/the-pudding/data

Some highlights:

Skate music dataset: https://github.com/the-pudding/data/tree/master/skate-music
The hipster summer reading list: https://github.com/the-pudding/data/tree/master/summer-reading
Colorism in High Fashion (Vogue magazine): https://github.com/the-pudding/data/tree/master/vogue

simon · September 30, 2020, 1:16pm

This article (The Good, the Rad, and the Gnarly) and accompanying dataset is VERY cool

melvin.wevers · November 11, 2020, 9:57am

For people working on computer vision, the following databases might be of interest.

I’m thinking of turning the latter into a training set for visual analysis of tobacco advertising through the years. If anyone is up for collaborating on creating this dataset and/or collaborating on this topic. Please let me know.

zacharykstine · November 25, 2020, 3:10pm

Erowid hosts a database of people’s experiences with a variety of psychoactive substances, which could be of interest to somebody. Some interesting analyses of this data can be seen here. They don’t make the data available in an easily downloaded format, but they seem willing to share it for academic research if you contact them for permission and cite them appropriately. Quote from here on data usage policy:

The reports in Erowid’s Experience Vaults are copyrighted by Erowid Center. Authors have permission to use their own reports as they wish. Researchers and authors may NOT “mine”, distill, or use aggregate data from Experience Vaults without prior permission. Publishing data analysis (in journals, books, or articles) without the prior permission of Erowid Center is a violation of the usage agreement of this website. Please contact us at copyrights@erowid.org to discuss projects and crediting requirements. We generally agree to such use, but misinterpretation of experience report data and improper citation and credit of Erowid in most peer-reviewed articles that use our data has let us to take this step. Permission is required before conducting or publishing data analysis of Erowid’s experience report collection.

folgert · November 25, 2020, 5:29pm

This reminds me of the dream database, @antalvdb