EMOTIONAL SPEECH RECOGNITION BASED ON CNN

Selma OZAYDIN

Authors

Selma OZAYDIN Department of Computer Programming, Cankaya University, Ankara, Turkey Author

Keywords:

Emotion Recognition, Audio Recognition, Feature Extraction, Convolutional Neural Network, Deep Learning

Abstract

Recognition of emotional expressions during human–machine interaction has become quite popular due to its increasing application areas. Convolutional neural network (CNN) is a class of deep neural network and uses the advantage of pattern analysis of the data. This paper presents a robust speech emotion recognition system based on CNN. In the development of the proposed system, CNN based acoustic models are obtained by using speech processing, and artificial intelligence technologies. During the implementation stage, transfer learning and deep learning procedures have been used for feature extraction of speech datasets. The proposed system has been trained with features extracted from RAVDESS and SAVEE datasets. For implementation of the emotional speech system, Alex-Net is used. The experiments and performance evaluations are conducted to demonstrate the effectiveness of the proposed speech emotional system.

References

Massaro, D. W., “Illusions and Issues in Bimodal Speech Perception”, Proceedings of Auditory Visual Speech Perception 98., Terrigal-Sydney Australia, pp. 21-26, December 1998.

De Silva, L. C., Miyasato, T., and Nakatsu, R., “Facial Emotion Recognition Using Multimodal Information”, In Proc. IEEE Int. Conf. on Information, Communications and Signal Processing (ICICS'97), Singapore, pp. 397-401, Sept. 1997.

Chen, L.S., Huang, T. S., Miyasato T., and Nakatsu R., Multimodal Human Emotion Expression Recognition, in Proc. of Int. Conf. on Automatic Face and Gesture Recognition, IEEE Computer Soc., Nara-Japan, April 1998.

Liyanage C De Silva, “Audiovisual Emotion Recognition”, 2004 IEEE International Conference on Systems, Man Cybernetics, Institute of Information Sciences and Technolog, New Zeland, pp.649-654,2004

Black, M. J. and Yacoob, Y., “Tracking and recognizing rigid and non-rigid facial motions using local parametric model of image motion”, In Proceedings of the International Conference on Computer Vision, IEEE Computer Society, pages 374–381, 1995.

Michel F. Valstar, Bihan Jiang, Marc Mehu, Maja Pantic, Klaus Scherer, The first facial expression recognition and analysis challenge, Ninth IEEE International Conference on Automatic Face and Gesture Recognition (FG 2011), Santa Barbara, CA, USA, 21-25 March 2011.

Hung Nguyen, Artificial Intelligence and Computational Intelligence, International Conference, AICI 2009, Shanghai, China, November 7-8, 2009.

C.C. Chiu, Y.L. Chang, and Y.J. Lai, The Analysis and Recognition of Human Vocal Emotions, in Proc. International Computer Symposium, Hsihchu-Taiwan, December 12-15, 1994.

F. Dellaert, T. Polzin and A. Waibel, “Recognizing Emotion in Speech”, in Proc. International Conf. on Spoken Language Processing, Philadelphia, PA, USA, pp. 1970-1973, October 3-6, 1996.

T. Johnstone, “Emotional Speech Elicited Using Computer Games”, in Proc. International Conf. on Spoken Language Processing, Philadelphia, PA, USA, pp. 1985-1988, October 3-6, 1996.

Ouyang, X., Kawaai, S., Goh, E. G. H., Shen, S., Ding, W., Ming, H., & Huang, D.-Y., “Audio-visual emotion recognition using deep transfer learning and multiple temporal models”, Proceedings of the 19th ACM International Conference on Multimodal Interaction, Singapore, 2017.

E. Avots, T. Sapinski, M. Bachmann, D. Kaminska, “Audiovisual Emotion Recognition in Wild”, Machine Vision and Applications, VOL. 30, ISSUE 5, PAGES 975-985, 19 July 2018

Kah Phooi Seng, Li-Minn Ang, Chien Shing Ooi., “A Combined Rule-Based & Machine Learning Audio-Visual Emotion Recognition Approach”, IEEE Transactions on Affective Computing, January-March 2018

Shiqing Zhang, Shiliang Zhang, Tiejun Huang, Wen Gao, and Qi Tian, “Learning Affective Features with a Hybrid Deep Model for Audio–Visual Emotion Recognition”, IEEE Transactions on Circuits and Systems for Video Technology, VOL. 28, ISSUE 10, October 2018.

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features with 3D Convolutional Networks”, ICCV 2015.

EMOTIONAL SPEECH RECOGNITION BASED ON CNN

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

cover