Dataset: word2vec model trained on lemmatized French Wikipedia 2018

Dataset

Description

The base corpus used for training is a dump of the French Wikipedia performed on 20 October 2018. The corpus was then processed to remove, as much as possible, the Mediawiki syntax, links, etc... Note that this is not perfect, but hopefully has little consequence on the weights calculated.

The corpus was then POS-tagged and lemmatized with TreeTagger. The list of tags can be found here. During POS-tagging, each word is replaced with its lemma and is replaced with the syntax `[lemma]_[tag]`. For instance, the sentence "Il a sauté dans sa voiture" will produce this output:
Datum van beschikbaarheid7-jun-2019
UitgeverUniversity of Groningen
Datum van data-aanmaak20-okt-2018

Citeer dit