MDETR is an end-to-end modulated detector thatdetects objects in an image conditioned on a raw text query . We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model . We show that our approach can be easily extended for visual questionanswering, achieving competitive performance on GQA and CLEVR . The code and models are available at https://://:// and the code is available at .

Author(s) : Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve, Nicolas Carion

