Automated extraction of page layout features

joris.van.zundert · April 23, 2020, 9:32am

Hi all,

Not sure this files under ‘data’, ‘modeling’, or both. In any case, I have a student who’s interested in extraction page structure features from late manuscript, early prints. Think coordinates of images and figures, ratio of area, dimensions, etc. She considers using ImageJ, which would work but would involve a considerable amount of handcrafted work requiring time which… who has time, right?

So I was wondering what the current state on automating such feature extraction is. And if potential tools in existing will work somewhat reliably on this kind of historic material.

Thanks for any leads!
Cheers
–Joris

melvin.wevers · April 23, 2020, 9:38am

These tools/implementations might be useful.
https://github.com/LibraryOfCongress/newspaper-navigator &. https://github.com/dhlab-epfl/dhSegment & https://github.com/leonlulu/DeepLayout

You might need to annotate your material and finetune the models included in these projects. Often you don’t need a lot of training material for these tasks.

frederik.elwert · April 28, 2020, 2:27pm

I guess this paper might also be helpful?

melvin.wevers · April 29, 2020, 7:09am

Thanks, I’m going to check this paper out. I’m trying to get a group of newspaper researchers together to create an annotated set of newspapers from different periods/regions.

frederik.elwert · April 29, 2020, 7:47am

In the German DH association DHd, there is a working group for historical newspapers and magazines: http://dig-hum.de/ag-zeitungen-zeitschriften They might be a good candidate for networking.

nils.reiter · April 29, 2020, 6:24pm

There is also https://grobid.readthedocs.io/en/latest/Introduction/

bencomp · March 1, 2023, 11:07pm

As this is still of interest to (commercial) research, you may also be interested in A survey of historical document image datasets.

I learnt over the past days and weeks that there are many similar terms for this task, which are related in different ways. For example, ‘Visual element extraction’, ‘document understanding’, ‘document layout analysis’ are not the same, but may be relevant when looking for useful methods.