Authors: Raphael Baena, Syrine Kalleli , Mathieu Aubry

Affiliation: ENPC Imagine

Description: We employ a transformer-based architecture that detects characters in parallel, ensuring fast and accurate predictions. For each character, it provides a boundary box and its likelihood, which are then used for Optical Character Recognition (OCR). Notably, this approach doesn't rely on any language prior.

We first pre-trained the architecture on synthetic data consisting of text lines with characters from various fonts. We use standard classification and bounding box positioning losses for this process.

Then, we can finetune the architecture on real datasets. Unlike synthetic data, these datasets do not include ground truth bounding boxes, but only text transcriptions. Therefore, we can't use the same training losses as before. Instead, we use the pre-trained model to detect the characters' bounding boxes. We then organize the characters based on these bounding boxes and compute the Connectionist Temporal Classification (CTC) loss. During fine-tuning, our approach demonstrates the ability to learn the bounding boxes of new characters.

Authors: The HR-Ciphers 2024 organizers

Affiliation: Computer Vision Center

Description: An Long Short-Term Memory (LSTM) Recurrent Neural Network model inspired by Baró et al. "Optical Music Recognition by Long Short-Term Memory Networks", GREC 2017

Authors: Simon Corbillé, Elisa H Barney Smith

Affiliation: Machine Learning, Luleå Tekniska Universitet

Description: 1 - Data specification
The images are resized and padding to a fix size of pixels (regarding the mean value of height and width). The training data is divided randomly into a train (80%) and a validation set (20%). During the training, we use affine augmentation on train data for data augmentation.
We found empirically, the use of a combination of the cipher dataset improve the recognition performance. For the task 2A, we train the model on a combination of Borg and BNF datasets. For task 2B, we train the model on a combination of Borg, Copiale and BNF. For task 3A, we train the model on a combination of Copiale and BNF. For the task 3B, we train the model on combination of Borg, Copiale, Ramanacoil datasets and consider classes where the number of samples in the training set is upper to 10.

2 - Method
We use a Sequence-to-Sequence model, one of the state-of-the-art architectures for handwriting recognition. It is composed by an encoder, an attention component and a decoder. The encoder uses a CRNN architecture. It is composed by convolutional layers for extract spatial features and LSTM layers for extract temporal features. The attention module focusses the decoders on a specific part of the features extracted by the encoder to predict character by character.
The model is trained with a hybrid loss (CTC loss for the encoder and Cross entropy loss for the decoder).

3 - Results
We evaluate our model on the validation set with the Character Error Rate (CER) metric. In this case a character can be a letter (A, B, C …), character, a symbol (Libra, Saturn …) or a letter with a diacritic. We can note the number of samples in validation set is low thus the results should be interpreted with caution.

We obtain:
Task 2A: 7.23% CER
Task 2B: 0.75% CER
Task 3A: 1.55% CER
Task 3B: 4.12% CER

We can note:
-Symbols are clearly separated and the writing is of good quality.
-Task 2A contains images with a fold at the beginning or at the end of the line.
-The lines are not clearly segmented in task 3B and can contain the previous and/or the next line.

Ranking Graphic