Ask us questions about the MacBERTh model(s)

l.fonteyn · December 1, 2021, 10:35am

We’re happy to announce the release of the first ‘MacBERTh’ model!

We’d like to use this forum to allow people to approach us with any questions. Ask away, and @enrique.manjavacas and @l.fonteyn will try to help you as much as we can.

More information on the project can be found on our website:

https://macberth.netlify.app/

MacBERTh is the cover term for a suite of language models (more specifically, BERT models) pre-trained on historical textual material (date range: 1450-1950).

Researchers who interpret and analyse historical textual material are well-aware that languages are subject to change over time, and that the way in which concepts and discourses of class, gender, norms and prestige function in different time periods. As such, it is quite important that the interpretation of textual/linguistic material from the past is not approached from a present-day point-of-view, which is why NLP models pre-trained on present-day language data are less than ideal candidates for the job. That’s where MacBERTh can help.

At present, a model pre-trained on historical English (1450-1950) has been published in the huggingface repository. The release of a Dutch historical model is planned for 2022.

For publications and talks related to the project, as well as information on how to cite us, please check our website. If you end up using our model for you research, we would love to highlight and refer others to your work on our website as well, so please do not hesitate to contact us to share your work.

ryan.heuser · December 3, 2021, 9:09am

Wow, this is fantastic! Thanks for your team’s hard work on this and for releasing the model. (Also for the great name, lol).

Since you asked for questions, my first Q would be: is it possible to isolate somehow a date-range from within the model to use? (Say, 1700-1800). I’m new to BERT so I apologize if this is naïve question. Or, have you found that the model is able to model language from specific slices of its historical range? The time-span is quite large and stretches over a lot of spelling reforms within the language: just wondering if that presents a problem for people working after those spelling reforms, for instance, or if the fact that it crosses those reforms indeed an advantage because it can connect words pre and post reform (vertu, vertue, virtu, virtue, etc).

Thanks again and I look forward to meeting and working with MacBERTh soon!

All best,
Ryan

enrique.manjavacas · December 3, 2021, 6:36pm

Hi Ryan,

Thanks for the nice words!

We’ve got already some questions on the periodization of the model. The easiest route would be to train a single model per period (assuming we do have a clear, general and meaningful way of splitting the history of English - which I doubt). But the problem is that you’ll be training on smaller portions, and that kind of defeats the purpose of training large language models. More interesting would be to fine-tune (using the BERT objective) the general model on whatever portion you are interested in for your particular use case. This has been done before by Hosseini et al and it’ll probably produce more accurate representations for your period. But more generally, I think the point is that in the end the time of writing is just yet another contextual factor that these “contextualized” models should be able to take into account. Given sufficient context, these models should be able (at least hypothetically) to infer that the text stems from a period with a particular spelling, a period where the target token has this and this word sense but not that other one, etc. I’d like to argue (and to test) that given enough data isolation shouldn’t be necessary, in the same way that we are not finding necessary to produce models for contemporary languages based on for instance the genre that the text belongs to.

mriggs · December 5, 2021, 10:09am

This is very exciting! I very much look forward to hearing more about MacBERTh and giving it a try!
I have more of a request than a question. Could you please consider including a detailed tutorial (or tutorials) for using MacBERTh on your website. A Jupyter notebook or a Google Colab notebook with actual examples of usage would be wonderfully helpful. Perhaps you could even consider putting some tutorials on the Programming Historian website (https://programminghistorian.org).
I make these requests (or suggestions) as someone who is eager to use MacBERTh, but lacks strong coding skills. This past week I’ve been playing with BERT, FlauBERT, and CamemBERT, trying my best to generate and visualize word embeddings. There are tons of online tutorials for BERT, but many (if not most) seem to assume a certain level of technical proficiency and aren’t always easy to follow.
Thank you!
p.s. your website looks great!

l.fonteyn · December 5, 2021, 10:28am

Hi @mriggs!

I have good news: the plan is indeed to complement the model with supplementary materials (the original plan was jupyter notebooks, but I’ve been looking into Google Colab too). In 2022, we hope to conduct a couple of template studies on humanities subjects (which involve, for instance, sense disambiguation, synonymy detection, sentiment analysis), which we will document in a notebook-like situation so that future users can adapt the code to their own case study. I can’t tell you how fast we’ll be – I tend to be quite optimistic and then life happens – but this is a project deliverable so it should happen (or else I will be chased by some people with pitchforks )

mriggs · December 5, 2021, 10:50am

Absolutely brilliant!!
With these supplementary materials could I also request that you include something on fine tuning of the model. For example, if I were to want to analyze period-specific or subject-specific texts, I would be eager to know how I could go about doing further training of MacBERTh on a period-specific or subject-specific corpus or whether that would even be necessary.
No need to respond! Enjoy your Sunday!