Dataset: word2vec model trained on lemmatized French Wikipedia 2018



The base corpus used for training is a dump of the French Wikipedia performed on 20 October 2018. The corpus was then processed to remove, as much as possible, the Mediawiki syntax, links, etc... Note that this is not perfect, but hopefully has little consequence on the weights calculated.

The corpus was then POS-tagged and lemmatized with TreeTagger. The list of tags can be found here. During POS-tagging, each word is replaced with its lemma and is replaced with the syntax `[lemma]_[tag]`. For instance, the sentence "Il a sauté dans sa voiture" will produce this output:
Date made available7-Jun-2019
PublisherUniversity of Groningen
Date of data production20-Oct-2018

Keywords on Datasets

  • French
  • word2vec
  • Tree Tagger

Cite this