Multimodal self-supervised learning is getting more and more attention as it allows networks to search and retrieve data across various modalities . The resulting embedding space enables retrieval of samplesacross all modalities, even from unseen datasets and different domains . Toevaluate our approach, we train our model on the HowTo100M dataset and evaluateits zero-shot retrieval capabilities in two challenging domains, namelytext-to-video retrieval, and temporal action localization, showing state-of-the-art results on four different datasets . To this end, we extend the concept of instance-level contrastive learning with a multimodalclustering step in the training pipeline .

Author(s) : Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang

Links : PDF - Abstract

Code :

Keywords : retrieval - learning - datasets - supervised - networks -

Leave a Reply

Your email address will not be published. Required fields are marked *