A high-performance word recognition system for the biological fieldnotes of the Natuurkundige Commissie

OnderzoeksoutputAcademicpeer review

9 Downloads (Pure)


In this research, a high word-recognition accuracy was achieved using an e-Science friendly deep learning method on a highly multilingual data set. Deep learning requires large training sets. Therefore, we use an auxiliary data set in addition to the target data set which is derived from the collection Natuurkundige Commissie, years 1820-1850. The auxiliary historical data set is from another writer (van Oort). The method concerns a compact ensemble of Convolutional Bidirectional Long Short-Term Memory neural networks. A dual-state word-beam search combined with an adequate label-coding scheme is used for decoding the connectionist temporal classification layer. Our approach increased the recognition accuracy of the words that a recognizer has never seen, i.e., out-of-vocabulary (OOV) words with 3.5 percentage points. The use of extraneous training data increased the performance on in-vocabulary words by 1 pp. The network architectures in an ensemble are generated randomly and autonomously such that our system can be deployed in an e-Science server. The OOV capability allows scholars to search for words that did not exist in the original training set.

Originele taal-2English
TitelCollect and Connect: Archives and Collections in a Digital Age 2020
RedacteurenAndreas Weber, Maarten Heerlien, Eulàlia Gassó Miracle, Katherine Wolstencroft
Aantal pagina's12
StatusPublished - 13-feb-2021
Evenement2020 International Conference Collect and Connect: Archives and Collections in a Digital Age, COLCO 2020 - Virtual, Leiden, Netherlands
Duur: 23-nov-202024-nov-2020

Publicatie series

NaamCEUR Workshop Proceedings
ISSN van geprinte versie1613-0073


Conference2020 International Conference Collect and Connect: Archives and Collections in a Digital Age, COLCO 2020
StadVirtual, Leiden

Citeer dit