A Novel Multi-Task and Ensembled Optimized Parallel Convolutional Autoencoder and Transformer for Speech Emotion Recognition

Document Type : Research Article

Authors

1 Department of Electrical Engineering, Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran

2 Department of Electrical Engineering, Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran.

Abstract

Recognizing the emotions from speech signals is very important in different applications of human-computer-interaction (HCI). In this paper, we present a novel model for speech emotion recognition (SER) based on new multi-task parallel convolutional autoencoder (PCAE) and transformer networks. The PCAEs have been proposed to generate high-level informative harmonic sparse features from the input. With the aid of the proposed parallel CAE, we can extract nonlinear sparse features in an ensemble manner improving the accuracy and the generalization of the model. These PCAEs also address the problem of the loss of initial sequential information during convolution operations for SER tasks. We have also proposed using a transformer in parallel with PCAEs to gather long-term dependencies between speech samples and make use of its self-attention mechanism. Finally, we have proposed a multi-task loss function made up of two terms of classification and AE mapper losses. This multi-task loss tries not only to reduce the classification error but also the regression error caused by the PCAEs which also work as mappers between the input and output Mel-frequency-cepstral-coefficients (MFCCs). Thus, we can both focus on finding accurate features with PCAEs and improving the classification results. We have evaluated our proposed method on the RAVDESS SER dataset in different terms of accuracy, precision, recall, and f1-score. The average accuracy of the proposed model on eight emotions outperforms all the recent baselines.

Keywords

Main Subjects


[1] C. Zhu, W. Ahmad, Emotion Recognition from Speech to Improve Human-Robot Interaction, in:  2019 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), 2019, pp. 370-375.
[2] P. Tarnowski, M. Kołodziej, A. Majkowski, R.J. Rak, Emotion recognition using facial expressions, Procedia Computer Science, 108 (2017) 1175-1184.
[3] C. Yu, M. Wang, Survey of emotion recognition methods using EEG information, Cognitive Robotics, 2 (2022) 132-146.
[4] Q. Wang, M. Wang, Y. Yang, X. Zhang, Multi-modal emotion recognition using EEG and speech signals, Computers in Biology and Medicine, 149 (2022) 105907.
[5] N. Azam, T. Ahmad, N. Ul Haq, Automatic emotion recognition in healthcare data using supervised machine learning, PeerJ Comput Sci, 7 (2021) e751.
[6] G. Liu, S. Cai, C. Wang, Speech emotion recognition based on emotion perception, EURASIP Journal on Audio, Speech, and Music Processing, 2023(1) (2023) 22.
[7] J. de Lope, M. Graña, An ongoing review of speech emotion recognition, Neurocomputing, 528 (2023) 1-11.
[8] X. Li, R. Lin, Speech Emotion Recognition for Power Customer Service, in:  2021 7th International Conference on Computer and Communications (ICCC), 2021, pp. 514-518.
[9] M. Bojanić, V. Delić, A. Karpov, Call Redistribution for a Call Center Based on Speech Emotion Recognition, Applied Sciences, 10(13) (2020) 4653.
[10] E. Andre, M. Rehm, W. Minker, D. Bühler, Endowing Spoken Language Dialogue Systems with Emotional Intelligence, 2004.
[11] S. Zepf, J. Hernandez, A. Schmitt, W. Minker, R.W. Picard, Driver Emotion Recognition for Intelligent Vehicles: A Survey, ACM Comput. Surv., 53(3) (2020) Article 64.
[12] B. Schuller, Towards intuitive speech interaction by the integration of emotional aspects, in:  IEEE International Conference on Systems, Man, and Cybernetics, 2002, pp. 6 pp. vol.6.
[13] P. Jamshidlou, N. Keshtiari, M. Eslami, M. Bahrani, Acoustic Representation of Intonational Elements in Persian Emotional Speech, 2013.
[14] F. Daneshfar, S. Kabudian, Speech Emotion Recognition Using Deep Sparse Auto-Encoder Extreme Learning Machine with a New Weighting Scheme and Spectral/Spectro-Temporal Features Along with Classical Feature Selection and A New Quantum-Inspired Dimension Reduction Method,  (2021).
[15] G. Drakopoulos, G. Pikramenos, E. Spyrou, S. Perantonis, Emotion Recognition From Speech: A Survey, 2019.
[16] A. Satt, S. Rozenberg, R. Hoory, Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms, 2017.
[17] C.H. Wu, W.B. Liang, Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels, IEEE Transactions on Affective Computing, 2(1) (2011) 10-21.
[18] S. Majuran, A. Ramanan, A feature-driven hierarchical classification approach to emotions in speeches using SVMs, in:  2017 IEEE International Conference on Industrial and Information Systems (ICIIS), 2017, pp. 1-5.
[19] M. Lashkari, S. Seyedin, NMF-based Cepstral Features for Speech Emotion Recognition, 2018.
[20] A. Guerrieri, E. Braccili, F. Sgrò, G.N. Meldolesi, Gender Identification in a Two-Level Hierarchical Speech Emotion Recognition System for an Italian Social Robot, Sensors, 22(5) (2022) 1714.
[21] W.Q. Zheng, J.S. Yu, Y.X. Zou, An experimental study of speech emotion recognition based on deep convolutional neural networks, in:  2015 International Conference on Affective Computing and Intelligent Interaction (ACII), 2015, pp. 827-831.
[22] S. Latif, R. Rana, S. Younis, J. Qadir, J. Epps, Transfer learning for improving speech emotion classification accuracy, arXiv preprint arXiv:1801.06353,  (2018).
[23] B. Xie, M. Sidulova, C.H. Park, Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion, Sensors, 21(14) (2021) 4913.
[24] M. Xu, F. Zhang, W. Zhang, Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset, IEEE Access, 9 (2021) 74539-74549.
[25] M.T. García-Ordás, H. Alaiz-Moretón, J.A. Benítez-Andrades, I. García-Rodríguez, O. García-Olalla, C. Benavides, Sentiment analysis in non-fixed length audios using a Fully Convolutional Neural Network, Biomedical Signal Processing and Control, 69 (2021) 102946.
[26] J. Deng, Z. Zhang, E. Marchi, B. Schuller, Sparse Autoencoder-Based Feature Transfer Learning for Speech Emotion Recognition, in:  2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, 2013, pp. 511-516.
[27] H. Aouani, Y.B. Ayed, Emotion recognition in speech using MFCC with SVM, DSVM and auto-encoder, in:  2018 4th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), 2018, pp. 1-5.
[28] W. Jiang, Z. Wang, J.S. Jin, X. Han, C. Li, Speech Emotion Recognition with Heterogeneous Feature Unification of Deep Neural Network, Sensors, 19(12) (2019) 2730.
[29] A. Waswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in:  NIPS, 2017.
[30] D. Li, J. Liu, Z. Yang, L. Sun, Z. Wang, Speech emotion recognition using recurrent neural networks with directional self-attention, Expert Systems with Applications, 173 (2021) 114683.
[31] M. Chen, X. He, J. Yang, H. Zhang, 3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition, IEEE Signal Processing Letters, 25(10) (2018) 1440-1444.
[32] M. Xu, F. Zhang, S.U. Khan, Improve Accuracy of Speech Emotion Recognition with Attention Head Fusion, in:  2020 10th Annual Computing and Communication Workshop and Conference (CCWC), 2020, pp. 1058-1064.
[33] L. Rabiner, R. Schafer, Theory and Applications of Digital Speech Processing, Pearson Education, 2011.
[34] E. Pintelas, I. Livieris, N. Barotsis, G. Panayiotakis, P. Pintelas, An autoencoder convolutional neural network framework for Sarcopenia detection based on multi-frame ultrasound image slices, 2021.
[35] C.S. Wickramasinghe, D.L. Marino, M. Manic, ResNet Autoencoders for Unsupervised Feature Learning From High-Dimensional Data: Deep Models Resistant to Performance Degradation, IEEE Access, 9 (2021) 40511-40520.
[36] Y. Sun, H. Mao, Q. Guo, Z. Yi, Learning a good representation with unsymmetrical auto-encoder, Neural Comput. Appl., 27(5) (2016) 1361–1367.
[37] S. Mei, J. Ji, Y. Geng, Z. Zhang, X. Li, Q. Du, Unsupervised Spatial–Spectral Feature Learning by 3D Convolutional Autoencoder for Hyperspectral Classification, IEEE Transactions on Geoscience and Remote Sensing, 57(9) (2019) 6808-6820.
[38] S.R. Livingstone, F.A. Russo, The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English, PLoS One, 13(5) (2018) e0196391.
[39] B. McFee, C. Raffel, D. Liang, D.P.W. Ellis, M. McVicar, E. Battenberg, O. Nieto, librosa: Audio and Music Signal Analysis in Python, in:  SciPy, 2015.
[40] Mustaqeem, S. Kwon, Att-Net: Enhanced emotion recognition system using lightweight self-attention module, Applied Soft Computing, 102 (2021) 107101.
[41] A. Bhavan, P. Chauhan, Hitkul, R.R. Shah, Bagged support vector machines for emotion recognition from speech, Knowledge-Based Systems, 184 (2019) 104886.
[42] D. Issa, M. Fatih Demirci, A. Yazici, Speech emotion recognition with deep convolutional neural networks, Biomedical Signal Processing and Control, 59 (2020) 101894.
[43] Mustaqeem, M. Sajjad, S. Kwon, Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM, IEEE Access, 8 (2020) 79861-79875.
[44] S. Sadok, S. Leglaive, R. Séguier, A Vector Quantized Masked Autoencoder for Speech Emotion Recognition, in:  2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2023, pp. 1-5.
[45] F. A. Dal Ri, F.C. Ciardi, N. Conci, Speech Emotion Recognition and Deep Learning: An Extensive Validation Using Convolutional Neural Networks, IEEE Access, 11 (2023) 116638-116649.
[46] D. Powers, Evaluation: From Precision, Recall, and F-Factor to ROC, Informedness, Markedness & Correlation, Mach. Learn. Technol., 2 (2008).