A unified-modalSpeechT5 framework explores the encoder-decoder pre-training for self-supervised speech/text representation learning . SpeechT5 canpre-train on a large scale of unlabeled speech and text data to improve the ability of the speech and textual modeling . To align the textual and speech information into a unified semantic space, we propose a random mixing-up method . Extensive evaluations on a wide variety of spoken language processing tasks, including voice conversion, automatic speech recognition, speech recognition and speakeridentification, show the superiority of the proposed speech-to-speech-recognition framework . For more information, please visit http://www.speecht5.org/

Author(s) : Junyi Ao, Rui Wang, Long Zhou, Shujie Liu, Shuo Ren, Yu Wu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei

Links : PDF - Abstract

Code :

Keywords : speech - recognition - unified - speecht - processing -

Leave a Reply

Your email address will not be published. Required fields are marked *