Hey, can anyone recommend me readings on the basics of validation? (Esp. within data-intensive, but insight-driven research workflows)
Basics on how, when, why to check your search/model/algorithm results, building gold standards, manual annotation tips, terminology, synthetic data etc?
Looking for basic papers to read for students in an introductory data+dh course, but also improve my own skills on this. It is one of the topics that is perhaps widely ignored or unsystematically done in DH papers. On the other hand, not sure if simple NLP foundations on this give the most relevant tips. (E.g. NLP usually has ground truth to rely on, and complex insight-driven workflows may need checking in various points in the cycle.) But maybe all the tips are the same…
Although it is mainly about Bayesian statistics, I can wholeheartedly recommend McElreacth’s book Statistical Rethinking. Especially for insight-driven quantitative research, this work offers many leads.
Thank you!! I also love Statistical Rethinking and should pick up the slack to finally properly work my way through it.
But indeed, I’m looking for some more general readings, about when and how you should check your results when doing these computational data-intensive humanities analysis. For statistics, these questions are at the forefront indeed yep.
This really short article by Epstein on modeling, ‘why model?’ I have found it very elucidating when trying to explain models. I agree that the ways in which validation works in the NLP are not always as useful for humanists. I think this article helps to reframe how we view models and the role they have in the production of knowledge.