Don t Use English Dev On the Zero Shot Cross Lingual Evaluation of Contextual Embeddings

Multilingual contextual embeddings have demonstrated state-of-the-art performance in zero-shot cross-lingual transfer learning. English dev accuracy is often uncorrelated (or even anti-correlated) with target language accuracy. We recommend providing oracle scores alongside zero shot results to make results more consistent by avoiding arbitrarily bad checkpoints. These reproducibility issues are also present for other tasks with different pre-trained embeddings (e.g., MLQA with XLM-R).

