A new training scheme for streaming automatic speech recognition (ASR) based on recurrent neural network transducers (RNN-T) allows the network to benefit from longer audio streams as input . We show that this extension of the acoustic context during training and inference can lead to word error rate reductions of more than 6% in a realistic productionsetting . We investigate its effect on acoustically challenging data containing background speech and present data points which indicate that this approach helps the network learn both speaker and environment adaptation . Finally, wevisualize RNN loss gradients with respect to the input features in order toillustrate the ability of a long short-term memory (LSTM) based ASR

Author(s) : Andreas Schwarz, Ilya Sklyar, Simon Wiesler

Links : PDF - Abstract

Code :

Keywords : rnn - asr - network - input - training -

Leave a Reply

Your email address will not be published. Required fields are marked *