OpenBoek: A Corpus of Literary Coreference and Entities with an Exploration of Historical Spelling Normalization

Andreas van Cranenburgh*, Gertjan van Noord

*Corresponding author voor dit werk

Onderzoeksoutput: ArticleAcademicpeer review

94 Downloads (Pure)

Samenvatting

We present OpenBoek: a corpus of 103k tokens of classic Dutch novels with annotated coreference and entities. The corpus has several properties that are challenging for current coreference models: long documents (fragments of 10k+ words each), domain-specific literary phenomena, and 19th century Dutch spelling. Spelling normalization is added to the corpus as an additional annotation layer, using a data-driven rule-based spelling normalization tool. Normalizations are added using meta-annotation, such that evaluation can be performed with annotations on the original texts without losing token alignment. This tool enables the application of parsing and coreference systems originally developed for modern Dutch. We evaluate parsing and coreference systems on the OpenBoek dataset and find that spelling normalization gives a substantial increase in performance. The OpenBoek corpus is available under an open licens at https://andreasvc.github.io/openboek/
Originele taal-2English
Pagina's (van-tot)235–251
Aantal pagina's17
TijdschriftComputational Linguistics in the Netherlands Journal
Volume12
StatusPublished - 22-dec.-2022

Vingerafdruk

Duik in de onderzoeksthema's van 'OpenBoek: A Corpus of Literary Coreference and Entities with an Exploration of Historical Spelling Normalization'. Samen vormen ze een unieke vingerafdruk.

Citeer dit