Human learning benefits from multi-modal inputs that often appear as richsemantics . This enables us to learn generalizable concepts from very limited visual examples . However, current few-shot learning (FSL) methods use numerical class labels to denote object classes which do not provide rich semantic meanings about the learned concepts . In this work, we show that by using ‘class-level’language descriptions, that can be acquired with minimal annotation cost, we can improve the FSL performance . We develop a Transformer based forward and backward encodingmechanism to relate visual and semantic tokens that can encode intricaterelationships between the two modalities. Forcing the prototypes to retainsemantic information about class description acts as a regularizer on the visual features, improving their generalization to novel classes at inference.Furthermore, this strategy imposes a human prior on the learned representations, improving the generalization of the models . Our experiments on four datasets and ablation studies show the benefit of

Author(s) : Mohamed Afham, Salman Khan, Muhammad Haris Khan, Muzammal Naseer, Fahad Shahbaz Khan

Links : PDF - Abstract

Code :
Coursera

Keywords : class - learning - visual - shot - rich -

Leave a Reply

Your email address will not be published. Required fields are marked *