Speech-based image retrieval has been studied as a proxy for jointrepresentation learning, usually without emphasis on retrieval itself . As such,it is unclear how well speech-based retrieval can work in practice — both in an absolute sense and versus alternative strategies that combine automaticspeech recognition (ASR) with strong text encoders . We alsoshow our best models can match or exceed cascaded ASR-to-textencoding when speech is spontaneous, accented, or otherwise hard toautomatically transcribe . Our best model configurationachieves large gains over state of the art, e.g., pushing recall-at-one from21.8% to 33.2% .

Author(s) : Ramon Sanabria, Austin Waters, Jason Baldridge

Links : PDF - Abstract

Code :

Keywords : retrieval - speech - based - image - asr -

Leave a Reply

Your email address will not be published. Required fields are marked *