Machine Learning Methods for Speech Emotion Recognition

Mr. Arun Kumar. E; Dr. Sapna B Kulkarni

doi:10.47392/IRJAEH.2025.0550

Authors

Mr. Arun Kumar. E 2nd Sem MTech Student, Department of CSE, RYM Engineering College VTU Belagavi, India. Author
Dr. Sapna B Kulkarni Professor, Department of CSE, RYM Engineering College VTU Belagavi, India. Author

DOI:

https://doi.org/10.47392/IRJAEH.2025.0550

Keywords:

CNN, SVM, CNN-LSTM

Abstract

Natural human-computer interaction requires the ability to identify human emotions from speech. Due to its many uses in virtual assistants, mental health evaluation, education, entertainment, and customer support systems, speech emotion recognition, or SE, has attracted a lot of attention lately. This study uses sophisticated feature extraction and classification techniques to investigate a machine learning-based method for speech emotion classification. In this work, we use acoustic features like spectral contrast, chroma, and Mel-Frequency Cepstral Coefficients (MFCC) to extract emotional cues from speech signals. Convolutional Neural Networks (CNN), Random Forest (RF), and Support Vector Machines (SVM) are among the classifiers that are trained and assessed using these features. It makes use of the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) serves as the training and testing benchmark dataset. According to experimental results, deep learning models—particularly CNN and CNN-LSTM hybrids—perform better than conventional machine learning techniques. Combining temporal and spectral features effectively captures emotional nuances in speech, as evidenced by the CNN model's 84.2% accuracy and the CNN-LSTM model's peak accuracy of 86.7%. The suggested model's robustness and capacity for generalization are validated by a thorough analysis employing confusion matrices and precision-recall metrics. Understanding user emotions can greatly improve the quality of interactions in real-world applications, and this research offers a solid basis for integrating SER systems. Future research will focus on handling noisy environments, enhancing cross-linguistic performance, and enabling real-time deployment of embedded systems. This study also emphasizes how crucial it is to choose the ideal feature combination to accurately depict emotional content. The addition of Chroma and Spectral Contrast improves the model's capacity to identify subtle emotional inflections, especially in similar-sounding classes like "calm" vs. "happy" or "angry" vs. "fearful," even though MFCCs provide a condensed and popular representation of the speech spectrum. To increase recognition accuracy across a variety of speaker profiles, feature fusion is essential. This study also contrasts shallow and deep learning classifiers to highlight their advantages and disadvantages. Traditional classifiers, such as SVM and Random Forest, perform poorly when working with raw or complex features, despite being computationally light and efficient for small-scale systems. On the other hand, automatic feature learning and temporal modeling help the CNN and CNN-LSTM architectures capture complex prosody, rhythm, and tone patterns linked to emotional expressions.

Downloads

Download data is not yet available.

Machine Learning Methods for Speech Emotion Recognition

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

Language

Information

Make a Submission