Training vanilla transformers on vision tasks has been shown to yield sub-optimal results . We propose to modify transformer structures by incorporating convolutional layers to improve performance . We show that our proposedtechniques stabilize the training and allow us to train wider and deeper visiontransformers, achieving 85.0\% top-1 accuracy on ImageNet validation set without introducing extra teachers or additional convolution layers . Our codewill be made publicly available athttps://://://github.com/ChengyueGongR/PatchVisionTransformer . We then propose a number of techniques to alleviate thisproblem, including introducing additional loss functions to prevent loss of information, and discriminate different patches by adding additional patch classification loss for Cutmix. We show our proposed techniques to ease this problem. Our proposed techniques will be available athttp://://www.gong.org/changang.com.org.uk/cheng

Author(s) : Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, Qiang Liu

Links : PDF - Abstract

Code :

https://github.com/alsoj/Recommenders-movielens


Coursera

Keywords : additional - training - techniques - loss - propose -

Leave a Reply

Your email address will not be published. Required fields are marked *