Skip to main navigation Skip to search Skip to main content

Bengali Parallel Universal Dependency Treebank

Research output: Contribution to conferencePaperAcademic

Abstract

This paper presents a Parallel Universal Dependency (PUD) treebank for the Indo-Aryan
language, Bengali. The treebank consists of 1000 Bengali sentences created using a
parallel corpus of English-Bengali and Hindi-Bengali. The number of tokens reported for
the 200 manually annotated sentences are 2622. Both the English and the Hindi corpus
was taken from the Parallel Universal Dependency (PUD) repository and subsequently
the English corpus was chosen as the source text. The corpus was then translated in
Bengali from scratch by the author, who is also a native speaker of the language, and
thereafter annotated based on universal parts of speech tag, language specific parts of
speech tag and on syntactic levels. The paper also illustrates the linguistic analysis of
the PUD treebank and concludes with the kappa score.
Original languageEnglish
Publication statusPublished - 11-Nov-2021
Externally publishedYes
EventWidening Natural Language Processing (WiNLP) - Hybrid, Punta Cana, Dominican Republic
Duration: 11-Nov-202111-Nov-2021

Workshop

WorkshopWidening Natural Language Processing (WiNLP)
Country/TerritoryDominican Republic
CityPunta Cana
Period11/11/202111/11/2021

Fingerprint

Dive into the research topics of 'Bengali Parallel Universal Dependency Treebank'. Together they form a unique fingerprint.

Cite this