Pre-training for feature extraction is an increasingly studied approach to better continuous representations of audio and text content . We use wav2vec and camemBERT as self-supervised learned models to perform continuous emotion recognition fromspeech (SER) on AlloSat, a large French emotional database describing thesatisfaction dimension, and on the state of the art corpus SEWA focusing onvalence, arousal and liking dimensions . To the authors’ knowledge, this paperpresents the first study showing that the joint use of wav 2vec and BERT-likepre-trained features is very relevant to deal with continuous SER task, usuallycharacterized by a small amount of labeled training data .

Author(s) : Manon Macary, Marie Tahon, Yannick Estève, Anthony Rousseau

Links : PDF - Abstract

Code :

Keywords : continuous - features - trained - ser - supervised -

Leave a Reply

Your email address will not be published. Required fields are marked *