In recent years, abstractive text summarization with multimodal inputs has started drawing attention due to its ability to accumulate information from different source modalities . Existing methods use short videos as the visual modality and short summary asthe ground-truth, therefore, perform poorly on lengthy videos and longground-truth summary . We then propose a factorizedmulti-modal Transformer based decoder-only language model, which inherently captures the intra-modality dynamics within various inputmodalities for the task . We use the abstract of correspondingresearch papers as the reference summaries, which ensure adequate quality anduniformity of the ground truth . Extensive experiments prove significant improvement over the baselines in both qualitative and quantitative evaluations on the existing How2 dataset forshort videos and newly introduced AVIATE dataset for videos with diverseduration, beating the best baseline on the two datasets by $1.39$ and $2.74$ROUGE-L points respectively. Extensive

Author(s) : Yash Kumar Atri, Shraman Pramanick, Vikram Goyal, Tanmoy Chakraborty

Links : PDF - Abstract

Code :

Keywords : videos - truth - summary - existing - abstractive -

Leave a Reply

Your email address will not be published. Required fields are marked *