1. Text Classification

CLS The Cross Lingual Sentiment CLS (Prettenhofer and Stein, 2010) dataset consists of Amazon reviews for three product categories: books, DVD, and music in four languages: English, French, German, and Japanese. Each sample contains a review text and the associated rating from 1 to 5 stars. Following Blitzer et al. (2006) and Prettenhofer and Stein (2010), ratings with 3 stars are removed. Positive reviews have ratings higher than 3 and negative reviews are those rated lower than 3. There is one train and test set for each product category. The train and test sets are balanced, including around 1 000 positive and 1 000 negative reviews for a total of 2 000 reviews in each dataset. We take the French portion to create the binary text classification task in FLUE and report the accuracy on the test set.

2. Paraphrasing

PAWS-X The Cross-lingual Adversarial Dataset for Paraphrase Identification PAWS-X (Yang et al., 2019a) is the extension of the Paraphrase Adversaries from Word Scrambling PAWS (Zhang et al., 2019b) for English to six other languages: French, Spanish, German, Chinese, Japanese and Korean. PAWS composes English paraphrase identification pairs from Wikipedia and Quora in which two sentences in a pair have high lexical overlap ratio, generated by LM-based word scrambling and back translation followed by human judgement. The paraphrasing task is to identify whether the sentences in these pairs are semantically equivalent or not. Similar to previous approaches to create multilingual corpora, Yang et al. (2019a) used machine translation to create the training set for each target language in PAWS-X from the English training set in PAWS. The development and test sets for each language are translated by human translators. We take the related datasets for French to perform the paraphrasing task and report the accuracy on the test set.

3. Natural Language Inference

XNLI The Cross-lingual NLI (XNLI) corpus (Conneau et al., 2018) extends the development and test sets of the Multi-Genre Natural Language Inference corpus (Williams et al., 2018, MultiNLI) to 15 languages. The development and test sets for each language consist of 7 500 humanannotated examples, making up a total of 112 500 sentence pairs annotated with the labels entailment, contradiction, or neutral. Each sentence pair includes a premise (p) and a hypothesis (h). The Natural Language Inference (NLI) task, also known as recognizing textual entailment (RTE), is to determine whether p entails, contradicts or neither entails nor contradicts h. We take the French part of the XNLI corpus to form the development and test sets for the NLI task in FLUE. The train set is obtained from the machine translated version to French provided in XNLI. Following Conneau et al. (2018), we report the test accuracy.

4. Parsing

4.1 Constituency Parsing

Syntactic parsing consists in assigning a tree structure to a sentence in natural language. We perform parsing on the French Treebank (Abeille et al., 2003), a collection of sentences extracted from French daily newspaper Le Monde, and manually annotated with both constituency and dependency syntactic trees and part-of-speech tags. Specifically, we use the version of the corpus instantiated for the SPMRL 2013 shared task and described by Seddah et al. (2013). This version is provided with a standard split representing 14 759 sentences for the training corpus, and respectively 1 235 and 2 541 sentences for the development and evaluation sets.

4.2 Dependency Parsing

Dependency parsing consists in extracting a dependency parse of a sentence, which defines the relationships between words based on their dependencies. The same French Treebank corpus as in the 4.1 is used for the dependency parsing task, but this time, with annotated dependency syntactic trees.

5. Word Sense Disambiguation

Word Sense Disambiguation (WSD) is a classification task which aims to predict the sense of words in a given context according to a specific sense inventory. We used two French WSD tasks: the FrenchSemEval task (Segonne et al., 2019), which targets verbs only, and a modified version of the French part of the Multilingual WSD task of SemEval 2013 (Navigli et al., 2013), which targets nouns.

5.1 Verb Sense Disambiguation

We made experiments of sense disambiguation focused on French verbs using FrenchSemEval (Segonne et al., 2019, FSE), an evaluation dataset in which verb occurrences were manually sense annotated with the sense inventory of Wiktionary, a collaboratively edited open-source dictionary. FSE includes both the evaluation data and the sense inventory. The evaluation data consists of 3 199 manual annotations among a selection of 66 verbs which makes roughly 50 sense annotated occurrences per verb. The sense inventory provided in FSE is a Wiktionary dump (04-20-2018) openly available via Dbnary (Se ́rasset, 2012). For a given sense of a target key, the sense inventory offers a definition along with one or more examples. For this task, we considered the examples of the sense inventory as training examples and tested our model on the evaluation dataset.

5.2 Noun Sense Disambiguation

We propose a new challenging task for the WSD of French, based on the French part of the Multilingual WSD task of SemEval 2013 (Navigli et al., 2013), which targets nouns only. We adapted the task to use the WordNet 3.0 sense inventory (Miller, 1995) instead of BabelNet (Navigli and Ponzetto, 2010), by converting the sense keys to WordNet 3.0 if a mapping exists in BabelNet, and removing them otherwise. The result of the conversion process is an evaluation corpus composed of 306 sentences and 1 445 French nouns annotated with WordNet sense keys, and manually verified. For the training data, we followed the method proposed by Hadj Salah (2018), and translated the SemCor (Miller et al., 1993) and the WordNet Gloss Corpus16 into French, using the best English-French Machine Translation system of the fairseq toolkit17 (Ott et al., 2019). Finally, we aligned the WordNet sense annotation from the source English words to the translated French words, using the alignment provided by the MT system. We rely on WordNet sense keys instead of the original BabelNet annotations for the following two reasons. First, WordNet is a resource that is entirely manually verified, and widely used in WSD research (Navigli, 2009). Second, there is already a large quantity of sense annotated data based on the sense inventory of WordNet (Vial et al., 2018) that we can use for the training of our system. We publicly release18 both our training data and the evaluation data in the UFSAC format (Vial et al., 2018).