The Multilingual Amazon Reviews Corpus

The Multilingual Amazon Reviews Corpus (MARC) contains reviews in English, Japanese, German, French, Spanish, and Chinese . Each record contains the review text, the review title, the star rating, an anonymized reviewer ID, and the coarse-grained product category . The corpus is balanced across the 5 possible star ratings, so each rating constitutes 20% of the reviews in each language . We propose the use of mean absolute error (MAE) instead of classification accuracy for this task, since MAE accounts for the ordinal nature of the ratings . We report baseline results for supervised text classification and zero-shot cross-lingual transfer learning by fine-tuning a multilingual BERT model on reviews data . For each language, there are 200,000

