Samenvatting
This paper presents a Parallel Universal Dependency (PUD) treebank for the Indo-Aryan
language, Bengali. The treebank consists of 1000 Bengali sentences created using a
parallel corpus of English-Bengali and Hindi-Bengali. The number of tokens reported for
the 200 manually annotated sentences are 2622. Both the English and the Hindi corpus
was taken from the Parallel Universal Dependency (PUD) repository and subsequently
the English corpus was chosen as the source text. The corpus was then translated in
Bengali from scratch by the author, who is also a native speaker of the language, and
thereafter annotated based on universal parts of speech tag, language specific parts of
speech tag and on syntactic levels. The paper also illustrates the linguistic analysis of
the PUD treebank and concludes with the kappa score.
language, Bengali. The treebank consists of 1000 Bengali sentences created using a
parallel corpus of English-Bengali and Hindi-Bengali. The number of tokens reported for
the 200 manually annotated sentences are 2622. Both the English and the Hindi corpus
was taken from the Parallel Universal Dependency (PUD) repository and subsequently
the English corpus was chosen as the source text. The corpus was then translated in
Bengali from scratch by the author, who is also a native speaker of the language, and
thereafter annotated based on universal parts of speech tag, language specific parts of
speech tag and on syntactic levels. The paper also illustrates the linguistic analysis of
the PUD treebank and concludes with the kappa score.
Originele taal-2 | English |
---|---|
Status | Published - 11-nov.-2021 |
Extern gepubliceerd | Ja |
Evenement | Widening Natural Language Processing (WiNLP) - Hybrid, Punta Cana, Dominican Republic Duur: 11-nov.-2021 → 11-nov.-2021 |
Workshop
Workshop | Widening Natural Language Processing (WiNLP) |
---|---|
Land/Regio | Dominican Republic |
Stad | Punta Cana |
Periode | 11/11/2021 → 11/11/2021 |