This paper does not describe a novel method. Instead, it studies astraightforward, incremental, yet must-know baseline given the recent progressin computer vision: self-supervised learning for Visual Transformers (ViT) While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built . We observe that instability is amajor issue that degrades accuracy, and it can be hidden by apparently good results . We reveal that these results are indeed partial failure, and they can be improved when training is made more stable . We benchmark ViT results in MoCov3 and several other self-Supervised frameworks, with ablations in variousaspects . We hope that this work will provide useful data points and

Author(s) : Xinlei Chen, Saining Xie, Kaiming He

Links : PDF - Abstract

Code :

Keywords : vit - results - supervised - training - transformers -

Leave a Reply

Your email address will not be published. Required fields are marked *