Is there such a thing as Humanities Data?

In a topic on imbalanced training data, @Nanne raised the following point “This is maybe naive, but whats ‘humanities data’? Initially I was like ‘oh right’, but if I think about it I am not sure if there is such a thing…”

I answered: “I guess what I mean is not so much humanities data, but real-life data, other than gold standard datasets. The latter is often not very representative, too clean, and contains often almost no temporal information. As in it’s heavily biased towards particular periods and sources. I’ll open a discussion, where we can dive into this a bit deeper.”

However, I think @Nanne raised a valuable point that deserves further discussion.


Well, I guess there is data typically discussed/studies by Humanities scholars? The way I see it is a bit like prototype theory and family resemblance: the things we often see as humanities data are connected by a number of features, but no feature is necessarily common to all datasets.

1 Like

In some contexts “Humanities Data” is simply used as a shorthand for “Data in Humanities research”, at least that’s how it’s used in the title of the Journal of Open Humanities Data (with the added “openness” of course). I agree that they are real-life data, but people may feel the need to add the “Humanities” label to differentiate them from the standard datasets used in computer science/computational linguistics. Maybe there will be a point in time when we won’t feel the need to specify the label “Humanities”, but that will be when we’ll have reached some sort of normalization.

1 Like

The term ‘real-life’ becomes ambiguous when working with, for example, fiction or visual art data imho, as I’m not sure a dataset of paintings is more ‘real-life’ than ImageNet. Would curated and uncurated be accurate terms to capture this distinction?

1 Like

It would say that it depends with what goal something is curated. If you work with a list of bestsellers, curated by The Guardian, for example, or a list of artworks by the MoMa, that is a different type of curation than, for example, ImageNet, or any other visual dataset. The selection criteria for inclusion and exclusion in these cases are often driven by technological aims, whereas in the former these are driven by cultural/humanistic inclinations. An obvious example is the lack of cultural diversity in many image datasets. This, however, does not mean that there are no technological considerations in data curated for humanistic purposes.

I guess in the end it boils down to the clash between the ambiguity of categories, or clear-cut categorization. And when modelling, how we actually deal with this ambiguity.

I guess this is also what @folgert alluded to when he mentioned family resemblance theory.


1 Like

I would say there are humanities questions in the first place, not necessarily a “humanities” data. And these questions usually are unsolvable with tech/engineering benchmark datasets, thus we dive deep in, as mentioned, in all kinds of historical uncertainty and imbalance. All data are not neutral of course, but the types of questions we ask, I believe, actively resist “normalizing” the world around?


The question “what is humanities data?” will inevitably lead us to the rabbit hole called “what are humanities?” :smile: And that one is a great mystery of the academia.

But, jokes aside, what humanities data is or isn’t largely depends – I think – on the disciplinary borders of the humanities and on the needs and interests of the humanities scholars (defined as people in the departments of history, literature, and others). Because, in principle, art can be (and is) studied by physicists, music – by psychologists, history – by biologists. Sometimes, I’m afraid, more successfully than by humanists…

Take this as a pragmatic definition.

Frankly, I would rather speak of “cultural data”. It is less tied to the disciplinary preferences and more – to the actual subject.


For, my inclusion criteria is any open data that I think is likely to be of interest to humanities scholars based on disciplinary tie-ins, past use, and sometimes methodological connections (network analysis data, geospatial data, etc.). It’s a broad response, but it works just fine in practice. That said, if I were a team of 50+, my approach would probably collapse in on itself.

Here’s the transcript of a lightning talk I gave on what counts as evidence in the humanities back in 2016: It’s not quite the same thing as what counts as data, but it’s closely related. A thinly edited summation of the main argument as applies to data is as follows:

Accordingly, I would argue that there is no a priori, metaphysical, or ontological quality of [any data type] that makes [it] count (or not count) in the humanities. Rather, it is how [data] are articulated into the ecology of humanistic argument that allows one to assess not whether they count, but when they count and when they do not count.

Although, we typically consider certain cultural forms as humanities data, how data fold into argument is really the critical litmus test. Medical humanists, for example, use all sorts of ostensibly medical data to make humanistic arguments about health justice. Lack of women in RCTs, disproportionate amounts of clinical testing in colonized nations, etc. None of these are “cultural forms,” but they are central to humanist arguments and, as such, become humanities data.


A naive and simple way to interpret this could be “data used for humanities research”. Then again, I think all data are just data. They are neutral and can be used for any research. For example, a social media dataset can be used for research in many fields such as computer science, communications, sociology, linguistic, etc. However, we don’t usually call this dataset a computer science data or sociology data, but simply say that the social media dataset is used in our research. The notable difference is perhaps the way different research fields process this dataset and their analysis. :slight_smile:


There are some differences, which often distinguish data which are in the focus of the work of scholars in the humanities from data in the natural sciences. One important aspect seems to be the fact that humanists are dealing with artifacts, which are often created with a more less specific intention, so this adds another layer to all the other, more physical aspects. Many artifacts in the humanities are signs (or have been used at some point as signs) or even complex structures of many signs. Not to forget, they have a history and many of them are historical objects. And last not least a lot of these artifacts which are signs and which have a history have also a distinct aesthetic quality. I wouldn’t want to say, that all humanities data are like this, but I think the prototypical data has 3 or even 4 of these properties.


I would first draw the line between literary and linguistic data.

An example is data on rhyming vs. cognates. Having pairs of words that rhyme is literary data, while cognates would be linguistic data. Both datasets allow you to determine the similarity of sounds (e.g. vowel similarity in imperfect rhyme), and how these similarities changed over time. Both can be used to reconstruct pronunctiation and linguistic typology:

List, J. M., Pathmanathan, J. S., Hill, N. W., Bapteste, E., & Lopez, P. (2017). Vowel purity and rhyme evidence in Old Chinese reconstruction. Lingua Sinica , 3 (1), 5.

Katz, J. (2015). Hip-hop rhymes reiterate phonological typology. Lingua , 160 , 54-73.

Rama, T., & List, J. M. (2019, July). An Automated Framework for Fast Cognate Detection and Bayesian Phylogenetic Inference in Computational Historical Linguistics. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 6225-6235).

List, J. M. (2016). Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution , 1 (2), 119-136.

However, with rhyme (or literary data in general) you deal with a considerable amount of poetic license. Poets might expand the notion of what rhyme can be and reinforce this through additional schema consistency in the poem.
Depending on what you want to look at, this makes it somewhat hard to figure out what is licensed and what is not, or which feature is marked, and what is just ordinary language use.

Adam Hammond wrote an interesting article on the divide of the disciplines and how they respectively deal with their data.

Hammond, A., Brooke, J., & Hirst, G. (2013, June). A tale of two cultures: Bringing literary analysis and computational linguistics together. In Proceedings of the Workshop on Computational Linguistics for Literature (pp. 1-8).

He raises the point that literary scholarship might be deliberately interested in ambiguity or polysemy and thus, distinct from other analytic schools, aims not to resolve ambiguity but to describe and explore it (cf. Jakobson, or Empson). I also like Fotis’ point that humanities data mainly deals with artefacts that not only have a historical but also an aesthetic dimension. The problem is then that the humanistic approach is mainly one of ‘criticism’ (to determine aesthetic valuation or historical embedding), rather than of finding universal laws, or probing (computational) methodology.

In Computational Linguistics, by contrast, ambiguity is almost uniformly treated as a problem to be solved. The focus is on disambiguation, with the assumption that one true, correct interpretation exists. I’m assuming this is partly grounded in the computer sciences that traditionally aim at deterministic systems, or at least replicable results and consistent datasets, but also recognizing that there is often a finite number of possibilities to analyze a certain linguistic phenomenon (e.g. scope underspecification in syntax).

Hammond notes that computational work in the humanities recognized the challenge of “subjective” annotation, or tries to find aspects of texts which readers would not find particularly ambiguous, for example identifying major narrative threads or distinguishing author gender.

I wonder how this historically grown methodological divide shaped the descriptive inventory (terminology) of the respective disciplines, and how this inventory can be expanded, e.g., by reconciling the different epistemological interests. Also, I am very interested in how we could determine the boundaries of (necessary and sufficient) ambiguity depending on the problem we are studying, and how much (non-)ambiguity certain annotation workflows allow and how models built on these datasets can deal with that.

When we annotated emotions in poetry, we tried to integrate the best of both worlds.

Haider, Thomas, Steffen Eger, Evgeny Kim, Roman Klinger, and Winfried Menninghaus. (2020). “PO-EMO: Conceptualization, Annotation, and Modeling of Aesthetic Emotions in German and English Poetry.” LREC 2020. arXiv preprint arXiv:2003.07723.

The first batches showed seemingly conflicting annotation in some places, e.g. two annotators would label some lines/stanzas with ‘Humor and Vitality (found it animating)’, while annotator three annotated ‘Sadness’. Upon inspection, we found a case of Schadenfreude in this particular poem. The annotators did not really recognize that Rilke might have intended a mixed emotional reaction (they were not supposed to interpret the text anyway, because this is difficult and time consuming).

In another case we found that annotators agreed that certain lines of Georg Trakl elicited both feelings of ‘Awe/Sublime’, but also of ‘Uneasiness’. This reinforced our notion that we need multiple labels per instance (line) to cover the emotional range of poetry, while not losing sight of complexity (thus only allowing two labels per line).

In the end we decided to have the annotators create a goldstandard of 48 poems with majority voting and discussing how their different views might be reconciled. When they were annotating the rest of the dataset we instructed them to annotate according to how they feel, and if they were not sure, they should annotate according to goldstandard, i.e., how they think the others would annotate. That improved consistency to a point that is usable for computational modeling (.7 kappa).

These are my five cents. Hope it helps.

1 Like