We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text . Given a video clip with aligned subtitles aspremise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip . A large-scale dataset, named Violin(VIdeO) and Language INference, is introduced for this task, which consists of95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hoursof video . These video clips contain rich content with diverse temporaldynamics, event shifts, and people interactions, collected from popular TV shows, and (ii) movie clips from YouTube channels . We present a detailedanalysis of the dataset

Author(s) : Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng Yu, Yiming Yang, Jingjing Liu

Links : PDF - Abstract

Code :
Coursera

Keywords : video - language - inference - hypothesis - clips -

Leave a Reply

Your email address will not be published. Required fields are marked *