A high-performance word recognition system for the biological fieldnotes of the Natuurkundige Commissie

OnderzoeksoutputAcademicpeer review

9 Downloads (Pure)

Samenvatting

In this research, a high word-recognition accuracy was achieved using an e-Science friendly deep learning method on a highly multilingual data set. Deep learning requires large training sets. Therefore, we use an auxiliary data set in addition to the target data set which is derived from the collection Natuurkundige Commissie, years 1820-1850. The auxiliary historical data set is from another writer (van Oort). The method concerns a compact ensemble of Convolutional Bidirectional Long Short-Term Memory neural networks. A dual-state word-beam search combined with an adequate label-coding scheme is used for decoding the connectionist temporal classification layer. Our approach increased the recognition accuracy of the words that a recognizer has never seen, i.e., out-of-vocabulary (OOV) words with 3.5 percentage points. The use of extraneous training data increased the performance on in-vocabulary words by 1 pp. The network architectures in an ensemble are generated randomly and autonomously such that our system can be deployed in an e-Science server. The OOV capability allows scholars to search for words that did not exist in the original training set.

Originele taal-2English
TitelCollect and Connect: Archives and Collections in a Digital Age 2020
RedacteurenAndreas Weber, Maarten Heerlien, Eulàlia Gassó Miracle, Katherine Wolstencroft
UitgeverijCEUR-WS.org
Pagina's92-103
Aantal pagina's12
StatusPublished - 13-feb-2021
Evenement2020 International Conference Collect and Connect: Archives and Collections in a Digital Age, COLCO 2020 - Virtual, Leiden, Netherlands
Duur: 23-nov-202024-nov-2020

Publicatie series

NaamCEUR Workshop Proceedings
UitgeverijCEUR
Volume2810
ISSN van geprinte versie1613-0073

Conference

Conference2020 International Conference Collect and Connect: Archives and Collections in a Digital Age, COLCO 2020
Land/RegioNetherlands
StadVirtual, Leiden
Periode23/11/202024/11/2020

Citeer dit