Automated extraction of page layout features

Hi all,

Not sure this files under ‘data’, ‘modeling’, or both. In any case, I have a student who’s interested in extraction page structure features from late manuscript, early prints. Think coordinates of images and figures, ratio of area, dimensions, etc. She considers using ImageJ, which would work but would involve a considerable amount of handcrafted work requiring time which… who has time, right?

So I was wondering what the current state on automating such feature extraction is. And if potential tools in existing will work somewhat reliably on this kind of historic material.

Thanks for any leads!
Cheers
–Joris

2 Likes

These tools/implementations might be useful.
https://github.com/LibraryOfCongress/newspaper-navigator &. https://github.com/dhlab-epfl/dhSegment & https://github.com/leonlulu/DeepLayout

You might need to annotate your material and finetune the models included in these projects. Often you don’t need a lot of training material for these tasks.

3 Likes

I guess this paper might also be helpful?

4 Likes

Thanks, I’m going to check this paper out. I’m trying to get a group of newspaper researchers together to create an annotated set of newspapers from different periods/regions.

1 Like

In the German DH association DHd, there is a working group for historical newspapers and magazines: http://dig-hum.de/ag-zeitungen-zeitschriften They might be a good candidate for networking.

1 Like

There is also https://grobid.readthedocs.io/en/latest/Introduction/

2 Likes

As this is still of interest to (commercial) research, you may also be interested in A survey of historical document image datasets.

I learnt over the past days and weeks that there are many similar terms for this task, which are related in different ways. For example, ‘Visual element extraction’, ‘document understanding’, ‘document layout analysis’ are not the same, but may be relevant when looking for useful methods.