DeepCNN: Spectro‐temporal feature representation for speech emotion recognition

Bibliographic Details
Title: DeepCNN: Spectro‐temporal feature representation for speech emotion recognition
Authors: Nasir Saleem, Jiechao Gao, Rizwana Irfan, Ahmad Almadhor, Hafiz Tayyab Rauf, Yudong Zhang, Seifedine Kadry
Source: CAAI Transactions on Intelligence Technology, Vol 8, Iss 2, Pp 401-417 (2023)
Publisher Information: Institution of Engineering and Technology (IET), 2023.
Publication Year: 2023
Subject Terms: Artificial neural network, Artificial intelligence, Feature (linguistics), FOS: Political science, Audio Signal Classification and Analysis, Social Sciences, Experimental and Cognitive Psychology, Convolutional neural network, FOS: Law, 02 engineering and technology, Speech recognition, Pattern recognition (psychology), 01 natural sciences, Leverage (statistics), decision making, QA76.75-76.765, Environmental Sound Recognition, Convolution (computer science), 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Psychology, Computer software, Speech Enhancement Techniques, Audio-Visual Speech Recognition, Political science, Audio Event Detection, Politics, deep learning, Linguistics, Computer science, FOS: Philosophy, ethics and religion, FOS: Psychology, Philosophy, Emotion Recognition, Signal Processing, Computer Science, Physical Sciences, Affective Computing, Computational linguistics. Natural language processing, FOS: Languages and literature, Feature extraction, P98-98.5, Representation (politics), Law, Emotion Recognition and Analysis in Multimodal Data
Description: Speech emotion recognition (SER) is an important research problem in human‐computer interaction systems. The representation and extraction of features are significant challenges in SER systems. Despite the promising results of recent studies, they generally do not leverage progressive fusion techniques for effective feature representation and increasing receptive fields. To mitigate this problem, this article proposes DeepCNN, which is a fusion of spectral and temporal features of emotional speech by parallelising convolutional neural networks (CNNs) and a convolution layer‐based transformer. Two parallel CNNs are applied to extract the spectral features (2D‐CNN) and temporal features (1D‐CNN) representations. A 2D‐convolution layer‐based transformer module extracts spectro‐temporal features and concatenates them with features from parallel CNNs. The learnt low‐level concatenated features are then applied to a deep framework of convolutional blocks, which retrieves high‐level feature representation and subsequently categorises the emotional states using an attention gated recurrent unit and classification layer. This fusion technique results in a deeper hierarchical feature representation at a lower computational cost while simultaneously expanding the filter depth and reducing the feature map. The Berlin Database of Emotional Speech (EMO‐BD) and Interactive Emotional Dyadic Motion Capture (IEMOCAP) datasets are used in experiments to recognise distinct speech emotions. With efficient spectral and temporal feature representation, the proposed SER model achieves 94.2% accuracy for different emotions on the EMO‐BD and 81.1% accuracy on the IEMOCAP dataset respectively. The proposed SER system, DeepCNN, outperforms the baseline SER systems in terms of emotion recognition accuracy on the EMO‐BD and IEMOCAP datasets.
Document Type: Article
Other literature type
Language: English
ISSN: 2468-2322
DOI: 10.1049/cit2.12233
DOI: 10.60692/nnnqq-j2m87
DOI: 10.60692/my169-amz68
Access URL: https://doaj.org/article/9ab1c70d6d5b4ebda06b843270d2ceac
Rights: CC BY
Accession Number: edsair.doi.dedup.....c7e24de2336bdc7194b4e0baf281b40c
Database: OpenAIRE
Description
ISSN:24682322
DOI:10.1049/cit2.12233