As a historian, I often deal with time series, that is data that is structured on some time-scale.
However, since the number of data points is often quite limited, or unequally distributed over time, I often find myself struggling with how to model time.
For example, let’s see you have only information on the level of years. You could model the sequence of years as a continuous variable. Based on this you can apply linear regression or more advanced smoothing functions.
On the other hand, years can also be approached as ordered, discrete variables, i.e. ordinal variables. This choice clearly impacts the way in which you model processes over time. Related to this is how to approach the distance between discretized values.
Without getting into Bersonian debates on time and duration (we can have those in the theory section), I am curious to hear how others deal with this issue?
Two thoughts off the top of my head. The first is that I often see DH beginners trying to make continuous or ordinal data work as categorical data. I don’t think it’s impossible to do this, but you’re often trying to make a square peg fit into a round hole. In a classification task, for example, a text from 1925 fitted to the 1900-1924 group is just in the wrong category rather than being scored as close to correct. (I’ve seen people put gaps in the date ranges to avoid this issue, e.g. classifying 1900-1920, then jumping to 1925-1945, but I don’t love this approach.) In my work, such as this ACH paper, I treated time as a continuous variable (linear regression) and good results, but I had data for every year between 1880 and 1925. The advantage of treating time as a continuous variable is that every float value has meaning. 1900.5 is modeled as a value halfway between 1900 and 1901. Further, the distance between any two variables is known, so a model that predicts a value of 1898 for ground truth of 1901 is off by 3 years. This allows you to calculate the average absolute error rate, which can be useful. My second thought is that, yes, there should be cases when the data require treating time as ordinal, especially if we have a fuzzier sense of a timeline or if you’re more interested in order than exact years. One example that comes to mind is modeling authorial periods when conducting stylochronometry, as in van Hulle, Dirk, and Mike Kestemont. “Periodizing Samuel Beckett’s Works: A Stylochronometric Approach.” Style 50, no. 2 (2016): 172-202. doi:10.1353/sty.2016.0003. If memory serves, they use an unsupervised clustering approach that weighed whether texts were directly before or after one another. @mike.kestemont I hope I’m not butchering this summary; feel free to correct me or add any detail I should have included.
Jip (and thanks for the mention!): in that paper, the order of the works was more important than the actual (scalar) time between them. We did something similar here, where we used Gries’ variability-based nearest neighbor clustering to mine TIME magazine’s archive for salient cultural breaking points (#1 = the end of WWII = no surprise): https://www.aclweb.org/anthology/W14-0609.pdf
Thanks for the thoughts and the pointers to articles.
You touch upon some of the issues I was referring too. Let’s see if we can dive a bit deeper into these issues. I’m not so much talking about classification tasks, in which I think binning time is indeed problematic, although sometimes necessary.
One of the difficulties in working with time is that people’s lived experience in a period is based on their perceptions of the past and future. This makes the perception of time more condensed in some periods than in others. This ties into the treating time as a continuous variable, while I see that when approaching time as a continuous variable, every float has a meaning, it’s not necessarily the case that the meaning is the same for the same distances between floats.
I often work with data that is fuzzy, has measurement errors, or is just missing. In those cases, ordinals can help, especially when modeling. Still, it feels weird treating time as such, since it’s not only the order but also the structure between the intervals. Happy to hear about possible ways to estimate these measurement errors for time stamps.
In a current project, I approach time using multilevel analysis. I treat datapoints referring to time on a continuous scale while using varying intercepts for clusters that make sense (years, decades, certain regimes, etc.). In this case, we can (hopefully) explain some of the variance and lack of data in some periods away through such a grouping.
Another approach that I have applied with @knielbo is Adaptive Fractal Analysis. This algorithm adaptively detrends to the time series to extract smoothed trends and then extracts different scaling regimes. In a way, these scaling regimes express the memory function, which we contend might be used as a proxy to study ‘lived experience/cultural memory’. The paper should be published any time soon now, here’s an ArXiv link).
Very interesting! The idea of a continuous timeline with varying intercepts is intriguing to me. I’m also interested in the history of advertisements, especially as they pertain to books and readers. I’ll check out that article.
I’m still working on it, but I’ll post the code on here sometime soon.
It’s not about historical data, but – forgive the self-promotion – in a recent paper, @mike.kestemont, @enrique.manjavacas and I experimented with so-called “monotonic effects” (as explained here and implemented in the R package brms), which allow you to model ordinal predictors without assuming them to be equidistant to some response variable. We applied it in the context of responses to a game, and it helped to show that the beginning of the game was more important to participants than the end.
If I’m correct, monotonic effects can only capture linear trends, either going up or down, but not higher-order polynomials. Still, it’s exactly this I’m trying to capture, contractions/expansions of time as a proxy for the lived experience of a period.
Alexander Koplenig has a number of papers demonstrating how things can go wrong if you mistreat linguistic time series data, could be useful, e.g. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0150771
Also, shameless self-promotion, but our recent paper deals with the consequences of binning corpus time series into discrete chunks (of variable size) in statistical testing: https://www.glossa-journal.org/article/10.5334/gjgl.909/