The Multilingual Amazon Reviews Corpus

We present the Multilingual Amazon Reviews Corpus (MARC), a large-scale collection of Amazon reviews for multilingual text classification. The corpus contains reviews in English, Japanese, German, French, Spanish, and Chinese, which were collected between 2015 and 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID, and the coarse-grained product category (e.g., 'books', 'appliances', etc.) The corpus is balanced across the 5 possible star ratings, so each rating constitutes 20% of the reviews in each language. For each language, there are 200,000, 5,000, and 5,000 reviews in the training, development, and test sets, respectively. We report baseline results for supervised text classification and zero-shot cross-lingual transfer learning by fine-tuning a multilingual BERT model on reviews data. We propose the use of mean absolute error (MAE) instead of classification accuracy for this task, since MAE accounts for the ordinal nature of the ratings.


Introduction
Text classification is one of the fundamental tasks in natural language processing, and research in this area has been accelerated by the abundance of corpora across different domains (e.g., Twitter sentiment (Pak and Paroubek, 2010), movie ratings (Maas et al., 2011), textual entailment (Bowman et al., 2015, restaurant reviews (Yelp Inc., 2019), among many others).
As with all other areas in NLP, progress in multilingual research relies on the availability of highquality data. However, large-scale multilingual text classification datasets are surprisingly rare, and existing multilingual datasets have some notable deficiencies.
The proprietary Reuters RCV1 (Lewis et al., 2004) and RCV2 (Reuters Ltd., 2005) corpora and its derivatives like MLDoc (Schwenk and Li, 2018) are relatively small; in RCV2, each language has ∼37,000 training examples on average, and the smallest language only has 1,794 examples. RCV1 and 2 are not easily accessible; a researcher who wishes to acquire the data would need to work with an organization that has obtained legal approval from Reuters Ltd.
The XNLI dataset (Conneau et al., 2018)  English languages. The Yelp corpus (Yelp Inc., 2019) contains reviews from international marketplaces, but the reviews from each marketplace can be written in multiple languages and the language identity is not provided. Furthermore, the Yelp corpus itself is refreshed from time to time, and previous versions are not made available for download, which affects the reproducibility of published results.
Several versions of the Amazon reviews corpus exist today. Neither the version from Ni et al. (2019) nor Amazon Inc. (2015) provide training, development, and test splits, and neither version focuses on the multilingual aspect of the reviews. Prettenhofer and Stein (2010) provide Amazon reviews in 4 languages (i.e., 2,000 training and test reviews, along with a variable number of unlabeled reviews), but the dataset is small by modern standards.
We address many of the above-mentioned limitations by releasing a subset of Amazon reviews specifically tailored for the task of multilingual text classification: • We provide 200,000 reviews in the training set for each of the languages in the corpus.
• We apply language detection algorithms to ensure reviews are associated with the correct language with high probability.
• We distribute the corpus on AWS Open Datasets for easy access by any research group for non-commercial purposes.
• Unlike previous Amazon reviews datasets, we split the data into clearly defined training, development, and test sets.
The dataset description, code snippets, and license agreement can be retrieved at https: //docs.opendata.aws/amazon-reviews-ml/ readme.html.

Data preparation 2.1 Inclusion Criteria
We gathered the reviews from the marketplaces in the US, Japan, Germany, France, Spain, and China for the English, Japanese, German, French, Spanish, and Chinese languages, respectively. We considered reviews that were submitted between November 1, 2015 and November 1, 2019. Only reviews with verified purchases were included.
We take no more than 20 reviews from the same product, and no more than 20 reviews from the same reviewer. Only products with at least 2 reviews were included in the dataset. Reviews must be at least 20 characters long.

Data Processing
The language of a review does not necessarily match the language of its marketplace (e.g., reviews from Amazon.de are primarily written in German, but could also be written in English, etc.). For this reason, we applied a language detection algorithm (Bojanowski et al., 2017) to determine the language of the review text. Only reviews written in the target language were retained. Based on a manual review of 200 randomly selected reviews per language, we observed 0, 0, 0, 0, 1, and 0 incorrectly classified reviews for English, Japanese, German, French, Spanish, and Chinese, respectively. At a score threshold of 0.8, the language filter removed 4.9%, 0.2%, 1.2%, 2.4%, 3.8%, and 5.3% of the English, Japanese, German, French, Spanish, and Chinese candidate reviews, respectively.
We also applied a vocabulary-based filter on the reviews. If a review contains a token that doesn't occur in at least 20 other reviews, then the review is excluded from the dataset. We used Jieba 1 for Chinese and KyTea 2 for Japanese word segmentation. The segmenters were only used during the filtering process, and the text provided in the dataset is not segmented or tokenized. (d) Amount of source language training data versus same-language and zero-shot transfer performance (fine-grained, MAE×100). The training data comes from the English portion of the corpus only. Table 2: mBERT classification mean absolute error (MAE×100). The 'fine-grained' classification task predicts the 5-star rating, whereas the 'binarized' task predicts whether the review is negative (i.e., 1-2 stars) or positive (i.e., 4-5 stars). Unless otherwise stated, we use the review body, review title, and product category as mBERT inputs.
We truncate all reviews at 2,000 characters. Newlines and tabs in the body of the review were removed.
Some Amazon reviews contain HTML markup. We used Lynx 3 to render the reviews as UTF-8 plain-text.
Product and reviewer IDs were anonymized by mapping each ID to a unique randomly generated integer.
We provide the product category labels for 30 common product types, and all other product categories are mapped to 'other'. 3 https://lynx.invisible-island.net

Corpus Characteristics
Amazon product ratings are given on a 5-star scale. To avoid any class imbalance issues in the dataset, we downsampled the reviews to ensure that each star rating constituted exactly 20% of the corpus. We provide 200,000, 5,000, and 5,000 reviews for the training, development, and test sets, respectively.
In Table 1, we compile some of the important statistics for the corpus. The number of unique products and reviewers is broadly similar across different languages.
In Figure 2, we show the distribution of product categories for each language. There is substantial variation in the distribution of product categories by language. Chinese reviews, most notably, are heavily skewed towards books.

Baseline Results
In Table 2, we provide baseline mean absolute error (MAE) results for supervised and zero-shot multilingual text classification with our corpus, where MAE (y,ŷ) = n i=1 |y i −ŷ i | n and y i ,ŷ i ∈ {1, 2, 3, 4, 5} are the true star rating and the predicted rating for the i-th review respectively. All of our baseline models are initialized with the cased multilingual BERT (mBERT) base model (Devlin et al., 2019), which has 110M parameters. Note that the star ratings for each review are ordinal, and a 2-star prediction for a 5-star review should be penalized more heavily than a 4-star prediction for a 5-star review. However, previous work on Amazon reviews classification (e.g., Yang et al., 2016) used the classification accuracy as the primary metric, which ignores the ordinal nature of the labels. We use MAE in our baselines as the primary metric instead. We also report the classification accuracy for completeness (Table 3), but we encourage the use of MAE in future work.

Experimental Setup
We predict the reviewer's rating using the text of the review (and possibly the product category) as the input. Following the procedure described in Devlin et al. (2019), we used the embedding of the CLS token for prediction. We fine-tuned the model for 15 epochs with the Adam optimizer using a constant learning rate of 8 × 10 −7 . We used minibatches of 32 reviews. Each experiment required ∼10 hours to complete with a single GPU on an AWS p3.8xlarge instance with the MXNet Glu-onNLP framework. We truncated the review body at 180 wordpieces if it exceeded 180 wordpieces.

Supervised Text Classification
In Table 2a, we report our MAE on the fully supervised classification task, where the languages of the training and evaluation data are the same (i.e., train on French reviews and test on French reviews, etc.). We distinguish between the 'fine-grained' classification task, where we predict on the 5-star scale, and the 'binarized' classification task, where we predict whether the reviewer gave 1 to 2 stars or 4 to 5 stars. For the binarized task, we drop the 3-star reviews in the training and evaluation data.
We also distinguish between the case where the input is the body of the review alone and where the input is the review body combined with the review title and product category. In the latter case, we use mBERT for sentence pair classification, where the first 'sentence' is the review body and the second 'sentence' is the review title concatenated with the product category. The details for sentence pair classification can be found in Devlin et al. (2019).

Zero-shot Text Classification
In Tables 2b and 2c, we report zero-shot crosslingual transfer MAE for fine-grained and binarized classification respectively, where we only fine-tune mBERT on data from one source language and test   Keung et al. (2020) showed that using the source language development set to select the checkpoint can lead to significant variation in zero-shot transfer performance and also recommended using the target development sets for checkpoint selection. Our results in Tables 2 and 3 follow their guidance, and we use the target development set to select the model checkpoint for each language.
In Table 2d, we vary the amount of English training data used in mBERT fine-tuning and examine the change in English test and non-English zeroshot MAE. Increasing the amount of English training data is generally helpful, although there are clearly diminishing returns.

Conclusion
We present a curated subset of Amazon reviews specifically designed to aid research in multilingual text classification. To the best of our knowledge, this is the largest public benchmark dataset for the training and evaluation of multilingual text classification models. With this work, we systematically address various gaps that we identified in existing multilingual corpora: we apply careful sampling, filtering, and text processing to the documents to minimize noise in the dataset, and we provide a large number of samples for training models in six languages with well-defined training, development, and test splits. We discuss the data preparation steps, analyze the distribution of the important characteristics of the corpus, and present baseline results for supervised and zero-shot cross-lingual text classification. With these contributions, we hope that this corpus will be an important resource to the research community.