A Novel Multi-Task and Ensembled Optimized Parallel Convolutional Autoencoder and Transformer for Speech Emotion Recognition

Document Type : Research Article

Authors

1 Department of Electrical Engineering, Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran

2 Department of Electrical Engineering, Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran.

Abstract

Recognizing the emotions from speech signals is very important in different applications of human-computer-interaction (HCI). In this paper, we present a novel model for speech emotion recognition (SER) based on new multi-task parallel convolutional autoencoder (PCAE) and transformer networks. The PCAEs have been proposed to generate high-level informative harmonic sparse features from the input. With the aid of the proposed parallel CAE, we can extract nonlinear sparse features in an ensemble manner improving the accuracy and the generalization of the model. These PCAEs also address the problem of the loss of initial sequential information during convolution operations for SER tasks. We have also proposed using a transformer in parallel with PCAEs to gather long-term dependencies between speech samples and make use of its self-attention mechanism. Finally, we have proposed a multi-task loss function made up of two terms of classification and AE mapper losses. This multi-task loss tries not only to reduce the classification error, but also the regression error caused by the PCAEs which also work as mappers between the input and output Mel-frequency-cepstral-coefficients (MFCCs). Thus, we can both focus on finding accurate features with PCAEs and improving the classification results. We have evaluated our proposed method on RAVDESS SER dataset in different terms of accuracy, precision, recall, and f1-score. The average accuracy of the proposed model on eight emotions outperforms all the recent baselines.

Keywords

Main Subjects