MDETR is an end-to-end modulated detector thatdetects objects in an image conditioned on a raw text query . We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model . We show that our approach can be easily extended for visual questionanswering, achieving competitive performance on GQA and CLEVR . The code and models are available at https://://:// and the code is available at .

Author(s) : Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve, Nicolas Carion

Links : PDF - Abstract

Code :

Keywords : mdetr - code - text - modulated - image -

Leave a Reply

Your email address will not be published. Required fields are marked *