Dataset

The full dataset is now available! You can download it here from Zenodo.

The dataset is substantially made up of the CATMuS dataset sampled from ~250 European medieval manuscripts written between the 8th and 16th centuries CE. All transcriptions were prepared by qualified paleographers following common transcription norms, which ensures consistency and quality across the corpus.

The data is published as whole-page facsimiles with line-level segmentation in ALTO XML.