HierArchical Multi-Modal EncodeR encodes a video at both the clip level and the fine-grained frame level . It is challenging to localize undefined segments in untrimmed andunsegmented videos where exhaustively searching over all possible segments is impossible . We conduct extensive experiments to evaluate our model onmoment localization in video corpus on ActivityNet Captions and TVR datasets . Our approach outperforms the previous methods as well as strong baselines,establishing new state-of-the-art for this task, we say . We propose a model that can extract information at different scales based on multiplesubtasks, namely, video retrieval, segment temporal localization, and maskedlanguage modeling . We say it can be used in video search, browsing, and navigation

Author(s) : Bowen Zhang, Hexiang Hu, Joonseok Lee, Ming Zhao, Sheide Chammas, Vihan Jain, Eugene Ie, Fei Sha

Links : PDF - Abstract

Code :


Keywords : video - localization - corpus - segments - model -

Leave a Reply

Your email address will not be published. Required fields are marked *