End-to-end (E2E) spoken language understanding (SLU) systems can infer thesemantics of a spoken utterance directly from an audio signal . However, training an E2E system remains a challenge, largely due to the scarcity of audio-semantics data . In this paper, we treat an E1 system as amulti-modal model, with audio and text functioning as its two modalities . We propose using different multi-modalspace losses to explicitly guide the acoustic embeddings to becloser to the text embeddeddings, obtained from a semantically powerful BERT model . We train the CMLS model on two publicly available E2Edatasets and show that our proposedtriplet loss function achieves the best performance. It achieves a relativeimprovement of 1.4% and 4% respectively over an E 2E model without

Author(s) : Bhuvan Agrawal, Markus Müller, Martin Radfar, Samridhi Choudhary, Athanasios Mouchtaris, Siegfried Kunzmann

Links : PDF - Abstract

Code :


Keywords : e - model - spoken - audio - language -

Leave a Reply

Your email address will not be published. Required fields are marked *