Results
Of the 26 registered teams, 12 submitted results, with all 12 participating in Task 1, 9 in Task 2, and 9 in Task 3. Three teams (Qianfan-OCR, STUDIUM.AI, and nampfiev1995) submitted only to Task 1. Over 300 individual submissions were recorded across the three tasks. Participants were permitted to use proprietary methods as well as additional public or non-public data beyond the competition training set.
The organizers’ baseline was obtained using the kraken OCR engine with the CATMuS Medieval 1.6 model, a general-purpose recognition model for medieval Latin-script manuscripts. No further adaptation or optimization was applied for any of the tasks.
In the tables below, each team’s best submission is reported. CER and WER values are given as percentages (lower is better). The organizers’ baseline is marked with †. One submission by the PERO group was received after the competition deadline, due to confusion about the permissibility of (pre-)training on proprietary datasets; the predictions submitted after the deadline were produced by their production model and are reported separately, marked with ★. Team names link to the method descriptions further down this page.
Task 1: Multilingual Recognition#
Macro-averaged CER and WER across French, Latin, and Spanish (%).
| Team | CER (%) | WER (%) |
|---|---|---|
| PERO | 7.71 | 26.43 |
| Swedish National Archives’ AI lab | 7.73 | 27.79 |
| École nationale des chartes | 8.03 | 26.70 |
| PRHLT-tS | 8.23 | 28.43 |
| FamilySearch & GGateway | 9.02 | 30.21 |
| TeamAnannya | 9.09 | 30.50 |
| Baseline† | 9.30 | 32.24 |
| Teodor Bors | 9.61 | 32.39 |
| Qianfan-OCR | 9.94 | 32.29 |
| STUDIUM.AI | 12.87 | 45.95 |
| Devansh Gupta | 14.81 | 38.85 |
| flame_cai (FLAME University) | 15.48 | 48.56 |
| nampfiev1995 | 42.38 | 87.61 |
| PERO★ | 6.94 | 24.11 |
Task 2: Intra-language Family Generalization (Occitan)#
CER and WER on the Occitan test set (%).
| Team | CER (%) | WER (%) |
|---|---|---|
| École nationale des chartes | 5.01 | 20.74 |
| PERO | 6.58 | 29.90 |
| Swedish National Archives’ AI lab | 6.83 | 32.35 |
| PRHLT-tS | 7.43 | 34.78 |
| Baseline† | 7.92 | 36.84 |
| FamilySearch & GGateway | 8.50 | 38.11 |
| TeamAnannya | 9.56 | 35.83 |
| Teodor Bors | 9.71 | 41.04 |
| Devansh Gupta | 13.08 | 41.52 |
| flame_cai (FLAME University) | 13.74 | 48.68 |
| PERO★ | 6.08 | 29.85 |
Task 3: Cross-language Family Generalization (Czech)#
CER and WER on the Czech test set (%).
| Team | CER (%) | WER (%) |
|---|---|---|
| PERO | 10.27 | 52.44 |
| École nationale des chartes | 10.79 | 52.64 |
| Swedish National Archives’ AI lab | 23.39 | 76.76 |
| FamilySearch & GGateway | 24.04 | 77.07 |
| Baseline† | 25.92 | 78.98 |
| PRHLT-tS | 25.99 | 80.86 |
| TeamAnannya | 27.17 | 80.36 |
| Teodor Bors | 27.38 | 80.96 |
| flame_cai (FLAME University) | 29.04 | 82.46 |
| Devansh Gupta | 33.35 | 84.16 |
| PERO★ | 10.11 | 52.33 |
Method descriptions
Descriptions were provided by the participating teams. Only the nampfiev1995 team did not submit a system description.
PERO
The OCR pipeline is built on the pero-ocr framework. A ParseNet-based text line detector, trained on the PERO layout dataset, is used to refine the geometry of the provided text lines. This refinement is applied both to the competition data and to additional datasets used for training enrichment: CATMuS Medieval, DISTINGUO, LAM, Rodrigo, TRIDIS, collections of Castilian and Sevillian medieval manuscripts, the Padeřov Bible, and TranscriboQuest 2025 Medieval Vernacular Religious Texts. pero-ocr is further used to crop text lines for training and, after training, to transcribe the test data. During inference, prefix search decoding with beam size 16 is applied to construct a confusion network, from which the most probable path is selected as the final transcription.
The text recognition model is an encoder–decoder Transformer capable of processing arbitrarily long text lines with a normalized height of 48 pixels. The encoder consists of a VGG-like convolutional backbone for visual feature extraction, followed by Transformer encoder layers. The decoder is composed of Transformer decoder layers and generates the output sequence autoregressively at the character level. Both encoder and decoder contain 6 Transformer layers with dimensionality 512, 8 attention heads, and an MLP inner dimensionality of 2048. Training was performed for 250k iterations with batch size 64 and learning rate 2 × 10⁻⁴, reduced by half after 50k and 150k iterations. Data augmentation includes blurring, noise injection, local geometric distortions, and masking.
Production model (after-deadline submission). The production model pipeline follows the same PERO system — ParseNet-based line refinement, line cropping and line transcription using pero-ocr — with a larger Transformer text recognizer for arbitrarily long line images normalized to a height of 48 pixels. The encoder combines a VGG-like feature extractor with 12 Transformer encoder layers of dimensionality 1024, 16 attention heads, and an MLP inner dimensionality of 4096. The decoder contains 10 Transformer layers with dimensionality 768, 10 attention heads, and an MLP inner dimensionality of 3072. Training was conducted for 25k iterations with batch size 64 and learning rate 1 × 10⁻⁴, reduced by half after 6k iterations. The same augmentation set was used. The production model is available for free through the ScribbleSense web application.
Swedish National Archives' AI lab
The model is TrOCR with two modifications from the original model. The first change is using 192 × 1024 as the image resolution instead of the default 384 × 384. This means slightly larger images in terms of number of pixels, but also higher resolution along the reading direction. The other change is using a byte-level vocabulary (the ByT5 tokenizer from Hugging Face’s transformers library). The team hypothesizes that this change makes the model less reliant on language patterns in the training data and thus better at generalizing to unseen languages.
The model was trained on the provided dataset, starting from a
checkpoint already trained on historical Swedish, which in turn was
based on the microsoft/trocr-large-handwritten checkpoint. This was
the most practical choice for the competition: the Swedish TrOCR had
been trained with the same modifications, so starting from that
checkpoint was expected to speed up convergence. Because the base model
was Swedish, the final vocabulary includes the Swedish letters å, ä, ö,
Å, Ä, Ö, which were added when training the base model.
Images were cropped based on their polygons’ bounding boxes, with any remaining areas outside the polygon masked out. Randomized image augmentation was applied (brightness/contrast changes, noise, blur, and simulated bleed-through). The model was trained for 4 epochs with a learning rate of 1 × 10⁻⁵ on a 90/10 train/validation split.
MEDUSA (École nationale des chartes)
MEDUSA (Medieval European Documents Unified System for Automated text recognition) was developed by a team at the École nationale des chartes – PSL. The system’s data pipeline draws on more than twenty repositories of European medieval HTR data, organized into three quality tiers: Platinum (image/ALTO XML pairs following the CATMuS guidelines), Gold (image/ALTO XML pairs following different guidelines), and Silver (text only). An original test set of 63 pages spanning 36 documents across 13 languages was produced to ensure fair comparison, alongside new Platinum training data, including a dedicated Occitan dataset of 55 pages from 10 documents. The total training corpus of Gold and Platinum data amounts to approximately 643,000 lines.
The model architecture is built around multi-stage fine-tuning of the Qwen 3.5 Vision Language Model in its 2B, 4B, and 9B parameter variants, using LoRA fine-tuning via the Unsloth framework. This approach capitalizes on the model’s existing multilingual competence, directing training effort toward adapting to medieval scripts rather than learning transcription from scratch. Prompts are adapted to dataset-specific conventions for Gold data, making the architecture naturally suited to multi-stage training. The best-performing configuration uses the 9B variant trained over three epochs on mixed Gold and Platinum data, followed by three additional Platinum-only epochs.
MEDUSA improves over the kraken/CATMuS 1.6 baseline by 1.2 percentage points on Task 1, 2.9 on Task 2, and 15.1 on Task 3, confirming the viability of multi-stage VLM fine-tuning for multilingual medieval HTR. The authors identify several directions for future work, including page-level segmentation and reading-order prediction, studying the effect of varying transcription conventions on model performance, and exploring specialization by language family. A Silver pre-training stage using text-only data and the generation of 500,000 synthetic lines for augmentation are currently ongoing.
Full system description: MEDUSA — Medieval Universal Script Analysis (PDF).
PRHLT-tS
All three tasks share a common visual encoder based on the LaiaCRNN architecture: a compact VGG-style CNN with 3×3 average pre-pooling (stride 3), four convolutional blocks with 3×3 kernels, LeakyReLU activations, batch normalization, and channel sizes [12, 24, 48, 96]. Max pooling 2×2 follows the third block. The resulting feature maps are height-normalized to H′ = 16 by adaptive average pooling and stacked vertically to produce a left-to-right sequence of visual tokens. Input images were rescaled while preserving aspect ratio so that each line contains roughly 3 to 5 frames per token, with the token count first estimated by an initial model before retraining from scratch on normalized images. Dynamic data augmentation through random affine distortions was applied throughout.
For Tasks 1 and 2, context is modeled with a 3-layer bidirectional LSTM (256 units per direction) followed by a linear classifier, with dropout of 0.5 before and after the LSTM layers. Optimization uses RMSProp with ReduceLROnPlateau scheduling on validation CER. Decoding employs an 8-gram character language model trained with SRILM on the training transcriptions.
For Task 3, the bidirectional LSTM was replaced by a 3-layer Transformer encoder with D = 512 and feedforward dimension 2048, followed by a linear classifier. Optimization uses AdamW (also with ReduceLROnPlateau scheduling), and decoding employs a 10-gram character language model. No external data were used for any task, either to train the neural recognizer or to build the language models.
FamilySearch & GGateway
For Task 1, the images were segmented into text lines and preprocessed using background noise reduction and straightening. A CNN-LSTM+CTC model was then trained on those text lines, together with a KenLM n-gram language model learned from the training transcriptions. The two were combined at decoding time to produce the final predictions.
For Tasks 2 and 3, the team developed a custom model that combines a ResNet backbone with an FPN layer for visual feature extraction, followed by a stack of Conformer blocks for sequence modeling. The Conformer blocks contain multi-head self-attention layers that capture long-range contextual dependencies. For recognition, a combined prediction head consisting of a primary CTC classifier and a lightweight Transformer decoder (attending to the encoder output with causal masking) is used. Since there are two prediction heads, the model was trained with a combined loss function weighted toward the CTC objective, using 80% weight for the CTC loss and 20% weight for the cross-entropy loss of the Transformer decoder. Optimization uses AdamW with cosine learning-rate decay. To improve generalization, epoch-wise random augmentation policies were applied, covering photometric distortion, blur and noise, geometric transforms, morphological operations, and document degradation, so that each epoch exposed the model to different document-style variations while preserving text content. All experiments were conducted on an NVIDIA Tesla T4 GPU with 15 GB VRAM.
TeamAnannya
The system combines three structurally different HTR architectures via ROVER character-level alignment to leverage architectural diversity for improved recognition. The primary component is an ensemble of five TrOCR models (ViT-large encoder + RoBERTa medieval decoder), each fine-tuned from the TrIDIs v1 base model on the competition training data with different augmentation strategies (baseline, rotation ±2.5°, elastic distortion, Gaussian blur, and weighted sampling with 5× oversampling for Latin lines). Each model was trained for 5 epochs with batch size 20. At inference time, test-time augmentation (TTA) is applied by running each model on three versions of every line image (original, slight rotation, and brightness perturbation), creating effectively 15 virtual models that vote via majority. Beam search with 4 beams and length penalty 2.0 is used for generation.
The second component is a custom model with a 1000-token byte-level BPE tokenizer trained on the competition corpus, paired with the same pre-trained ViT-large encoder and a fresh 6-layer BERT decoder. The motivation was that medieval Unicode characters (⁊, ꝑ, ꝯ, combining marks) are fragmented into multiple tokens by the generic 50K-vocabulary RoBERTa tokenizer, while a domain-specific BPE encodes them more efficiently. The model was trained in two phases: 5 epochs with the encoder frozen (LR = 1 × 10⁻⁴), followed by 10 epochs of full fine-tuning (LR = 5 × 10⁻⁶). The third component is the publicly available CATMuS Medieval kraken model, which uses a CNN+LSTM+CTC architecture and was used directly without fine-tuning.
The three model outputs are combined via a guarded ROVER ensemble that performs character-level edit-distance alignment and per-position majority voting, with safety checks. Merges are only applied when the source predictions are within 20% edit distance of each other; merges that grow output length beyond max input + 2 characters or change word count by more than 1 are rejected, and the primary model’s prediction is used instead. This guarding prevents the “Frankenstein merge” failure mode, where ROVER combines dissimilar predictions into garbled output.
Teodor Bors (NextMamba-OCR)
The final system is a non-autoregressive multilingual HTR pipeline
implemented and tuned specifically for medieval line transcription. It
follows a monotonic alignment strategy (CTC) and avoids autoregressive
text generation at decoding time. Architecturally, a ConvNeXt-V2-Tiny
visual encoder (optionally FastViT) is followed by a bidirectional
Mamba-2 sequence module and a linear character classifier. Given a
line image x ∈ ℝC×H×W, the input is normalized to a fixed
height (typically H = 64), converted to grayscale, and passed through
optional augmentation and script-preserving Unicode preprocessing
before the 2D features are mapped to a temporal sequence [B, L, D]. A
low effective stride (convnext_num_stages=2) is kept to maintain
enough CTC frames for long medieval lines, and the sequence is then
processed with forward/backward Mamba-2 streams whose outputs are
fused before log-softmax.
Training uses CTC with average_frames=True, which reduces loss bias
toward wider lines and improves stability across manuscripts with
highly variable line lengths. Mixed precision (bf16), gradient
accumulation, warmup, and validation-driven scheduling / early
stopping are used. Inference is available in three equivalent
deployment modes: native PyTorch with Mamba kernels, a pure-PyTorch
Mamba fallback (for CPU/AMD or environments without Triton/CUDA
kernels), and ONNX export for portable runtimes. The pure-PyTorch
Mamba path and the ONNX path reproduce the same transcripts as the
main model when preprocessing is aligned.
In parallel to the main system, three LLM-based OCR systems were also tested and submitted (LightOnOCR-2, GLM-OCR, and Chandra OCR 2). Despite competitive behavior in some cases, generative OCR approaches were observed to be more prone to hallucination-like insertions and substitutions on difficult medieval lines and unstable under cross-domain manuscript variation. For this reason, the CTC-based implementation was kept as the primary system: it is more controllable, strictly monotonic, and easier to calibrate for paleographic fidelity.
Qianfan-OCR
The system is based on Qianfan-OCR, a 4B-parameter end-to-end document
intelligence model developed by the Baidu Qianfan Team. The model
adopts a vision-language architecture consisting of three components:
a vision encoder (Qianfan-ViT) that dynamically tiles input images
into 448 × 448 patches and supports up to 4K resolution, a lightweight
two-layer MLP adapter for cross-modal alignment, and a Qwen3-4B
language model backbone for text generation and reasoning. Unlike
traditional multi-stage OCR pipelines, Qianfan-OCR performs direct
image-to-Markdown conversion and supports prompt-driven tasks
including structured document parsing, table extraction, chart
understanding, and key information extraction — all within a single
model. It also features a Layout-as-Thought mechanism, an optional
thinking phase triggered by <think> tokens, in which the model
generates bounding boxes, element types, and reading order before
producing final outputs, which improves accuracy on documents with
complex layouts. The base model was trained using a four-stage
progressive recipe: cross-modal alignment, foundational OCR training,
domain-specific enhancement, and instruction tuning.
For this competition, the model was further fine-tuned at the instruction-tuning stage by incorporating multilingual historical document data. The training data consists of three sources: (1) the official competition training data, (2) open-source OCR training datasets related to historical documents, and (3) synthetic OCR training data, produced by rendering document images from historical text corpora. This combination enhanced the model’s ability to recognize and parse ancient texts across different languages and scripts.
For more details on the base model, see the technical report: https://arxiv.org/abs/2603.13398.
STUDIUM.AI
For Task 1, dedicated to multilingual recognition, a TrOCR model is fine-tuned on the provided multilingual training dataset. To capture language-specific characteristics, eight LoRA-based HTR adapters are trained, each specialized on the subset of the training data corresponding to one of the target languages. During training, a lightweight gating network is applied on top of the encoder representations to predict a distribution over the languages, with a small bias added to encourage the model to leverage prior knowledge about the true language of each line. This predicted distribution is then used to compute a weighted mixture of the LoRA adapters, producing a dynamically constructed adapter that reflects not only the language of the input line image but also the linguistic similarities between that language and the other seven languages in the training data. By training this mechanism end-to-end, the model simultaneously learns to recognize text and infer the language of the input line, allowing it to generate the appropriate mixture weights at inference time when explicit language labels are unavailable.
Devansh Gupta
The system is built on Qwen 3.5-0.8B, a compact 0.8-billion-parameter vision-language model, fine-tuned via a three-phase training pipeline on a single NVIDIA A100 80 GB GPU. The training data is the CMMHWR 2026 dataset (ALTO v4 XML annotations with manuscript page images) spanning eight medieval languages: Castilian, Catalan, French, Galician, Italian, Latin, Navarrese, and Venetian. During data preparation, ALTO polygons are parsed, each text line is cropped with polygon masking (outside regions filled with white), aspect-preserving resize is applied to a fixed 32 × 384 pixel canvas, and CLAHE contrast enhancement is used on the L-channel of LAB colour space to normalise parchment contrast without altering the original ink hue. Low-resource languages (Catalan, Galician, Navarrese, Venetian) are augmented with 2-line and 3-line concatenated samples to improve coverage.
Phase 1 (Supervised Fine-Tuning). Eight independent language-specific LoRA adapters (rank 16, alpha 16) are trained using Unsloth’s SFTTrainer with the UnslothVisionDataCollator. Both vision and language layers are fine-tuned, enabling the ViT encoder to adapt to ink degradation, parchment texture, and scribal variation, while the decoder learns medieval orthography and abbreviation conventions. Each adapter is trained for 5 epochs with sequence packing (multiple short OCR lines packed into a 2048-token window), NEFTune noise injection (α = 5.0) for regularisation, and cosine learning-rate scheduling (LR = 2 × 10⁻⁴).
Phase 2 (Adapter Merging). The eight per-language adapters are consolidated into a single universal adapter via TIES-Merging (Yadav et al., 2023): small-magnitude weight deltas are trimmed (bottom 20%), majority sign election resolves cross-language conflicts, and only agreeing parameters are summed and normalised. The merged adapter is further refined with 2 epochs of mixed-language fine-tuning (LR = 1 × 10⁻⁴) on all languages simultaneously to recover any performance loss.
Phase 3 (Reinforcement Learning) & Inference. GRPO-based reinforcement learning is applied to improve performance on failure modes such as minim-character confusions (i/u/n/m), hallucination loops, and word-count mismatches. The Phase 2 adapter is merged into base weights, and a new LoRA (r = 16, α = 32) is trained with vision layers frozen. Training uses a held-out set (~9,280 lines) with a six-component reward function: CER (2.0), positional word accuracy (1.5), word-count bonus (1.0), anti-hallucination penalty (1.0, hard −2.0), length-mismatch penalty (0.5), and minim-confusion bonus (0.3). For inference, greedy decoding is used across all learned language tokens, and predictions are ensembled using character-level and word-level ROVER voting, shortest-output selection, median-length selection, and full-string mode voting, producing task-specific submissions.
flame_cai (FLAME University)
The team fine-tunes the catmus-medieval 1.5.0 model from the kraken OCR
system on 80% of the training corpus. Each line image is converted to
grayscale and resized to a fixed height of 80 pixels while preserving
its aspect ratio. The pretrained model is adapted to the target
character set by extending its output layer to accommodate any
characters present in the new data but absent from the original
training vocabulary, while preserving the weights learned during
pretraining. The backbone is frozen during the initial five training
epochs, allowing only the output layer to update before the entire
network is trained end-to-end.
Training is conducted for up to 40 epochs with early stopping based on validation accuracy, requiring a minimum of 15 epochs before stopping is considered and a patience of 10 epochs without improvement. A learning rate of 5 × 10⁻⁵, a batch size of 48, and data augmentation are applied to improve generalization on the limited manuscript data. Model performance is assessed on the held-out validation set using Character Error Rate (CER), and the best-performing checkpoint is selected for downstream evaluation.