Transformer-based methods such as RoBERTa and GPT-3 have led to advances in natural language processing tasks such as question answering and commonsense reasoning. We find clear evidence that fine-tuned commonsense language models still do not generalize well, even with moderate changes to the experimental setup, and may, in fact, be susceptible to dataset bias. We also perform selective studies, including qualitative and consistency analyses, to gain deeper insight into the problem. We also study the generalization issue in detail by designing and conducting a rigorous scientific study. Using five common benchmarks, multiple controls and statistical analysis, we find clear evidence that well-tuned language models do not generalize well.

Author(s) : Mayank Kejriwal, Ke Shen

Links : PDF - Abstract

Code :

Keywords : language - models - generalize - evidence - find -

