Attention-based encoder-decoder (AED) models have achieved promising performance in speech recognition . However, because the decoder predicts texttokens in an autoregressive manner, it is difficult for an AED model to predict all tokens in parallel . We propose a non-autoregressive speech recognition model called LASO(Listen Attentively, and Spell Once) The model consists of an encoder, adecoder, and a position dependent summarizer (PDS) The three modules are based on basic attention blocks. The encoder extracts high-level representations from the speech. The PDS uses positional encodings corresponding to tokens to convert the acoustic representations into token-level representation . At last, the probability distribution on the vocabulary is computedfor each token position is computed for each token positions. The probability distribution is computed. At least, we propose a cross-modaltransfer learning method to refine semantics from a large-scale pre-trained language model

Author(s) : Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, Shuai Zhang

Links : PDF - Abstract

Code :

Keywords : speech - model - token - autoregressive - encoder -

Leave a Reply

Your email address will not be published. Required fields are marked *