Evaluation

Submissions are expected to be on the line level, i.e., one prediction for each line identifier in the test set.

Submission will be evaluated using Character Error Rate (CER) and Word Error Rate (WER). For the first task, systems will be ranked based on the unweighted average of per-language CER and WER. For the second and third tasks, output on the single-language test set will be evaluated.

Metrics#

Character Error Rate (CER) is computed at the codepoint level, i.e., each Unicode codepoint counts as one character. Predictions are normalized to NFD (Normalization Form Canonical Decomposition) before comparison to match our ground truth.

Word Error Rate (WER) is computed by splitting the text on whitespace.

Both metrics are calculated using the edit distance between the prediction and ground truth, normalized by the length of the ground truth. We are using the torchmetrics implementation of CharErrorRate/WordErrorRate.

Submission Format#

Submissions must be JSON files containing a mapping from line identifiers to predicted text:

{
  "line_001": "predicted text for line 1",
  "line_002": "predicted text for line 2",
  ...
}

Line identifiers correspond to the ID attribute of TextLine elements in the ALTO XML test set files.

Evaluation Script#

We provide an evaluation script that computes CER and WER for your predictions against ALTO XML files. You can use it during development of your submission to evaluate against your validation set.

Installation#

The script requires Python 3.10+ and the torchmetrics package:

pip install torchmetrics

Usage#

python eval.py predictions.json ground_truth1.xml [ground_truth2.xml ...]

The script will output the CER and WER scores for all lines with matching identifiers between the predictions and ground truth files.