Reference-free evaluation holds the promise of web-scale comparison of MT systems . We find that M-BERT and LASER perform poorly as semantic encoders for reference-free MT evaluation . We propose two partial remedies: (1) post-hoc re-alignment of vector spaces and (2) coupling of semantic-similarity based metrics with target-side language modeling . In segment-level MT evaluation, our best metric surpasses reference-based BLEU by 5.7 correlation points, according to the paper . The paper concludes that the failure to punish {“}translationese{”, i.e., low-quality literal translations, is a key problem for multilingual encodering systems that can’t be punished by the language language systems that don’t work properly .

