ارزیابی معماری یادگیری عمیق برای تشخیص گفتار احساسی
ترجمه نشده

ارزیابی معماری یادگیری عمیق برای تشخیص گفتار احساسی

عنوان فارسی مقاله: ارزیابی معماری یادگیری عمیق برای تشخیص گفتار احساسی
عنوان انگلیسی مقاله: Evaluating deep learning architectures for Speech Emotion Recognition
مجله/کنفرانس: شبکه های عصبی - Neural Networks
رشته های تحصیلی مرتبط: مهندسی کامپیوتر، فناوری اطلاعات
گرایش های تحصیلی مرتبط: هوش مصنوعی، شبکه های کامپیوتری
کلمات کلیدی فارسی: محاسبات عاطفی، یادگیری عمیق، شناخت احساسی، شبکه های عصبی، تشخیص گفتار
کلمات کلیدی انگلیسی: Affective computing، Deep learning، Emotion recognition، Neural networks، Speech recognition
نوع نگارش مقاله: مقاله پژوهشی (Research Article)
شناسه دیجیتال (DOI): https://doi.org/10.1016/j.neunet.2017.02.013
دانشگاه: School of Engineering - RMIT University - Melbourne VIC - Australia
صفحات مقاله انگلیسی: 9
ناشر: الزویر - Elsevier
نوع ارائه مقاله: ژورنال
نوع مقاله: ISI
سال انتشار مقاله: 2017
ایمپکت فاکتور: 8/446 در سال 2017
شاخص H_index: 121 در سال 2019
شاخص SJR: 2/359 در سال 2017
شناسه ISSN: 0893-6080
شاخص Quartile (چارک): Q1 در سال 2017
فرمت مقاله انگلیسی: PDF
وضعیت ترجمه: ترجمه نشده است
قیمت مقاله انگلیسی: رایگان
آیا این مقاله بیس است: بله
کد محصول: E10738
فهرست مطالب (انگلیسی)

Abstract

1- Introduction

2- Related work

3- Deep learning: An overview

4- Proposed speech emotion recognition system

5- Experimental setup

6- Experiments and results

7- Discussion

8- Conclusion

References

بخشی از مقاله (انگلیسی)

Abstract

Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the models’ performances.

Introduction

In recent years, deep learning in neural networks has achieved tremendous success in various domains that led to multiple deep learning architectures emerging as effective models across numerous tasks. Feed-forward architectures such as Deep Neural Networks (DNNs) and Convolutional Neural Networks (ConvNets) have been particularly successful in image and video processing as well as speech recognition, while recurrent architectures such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) RNNs have been effective in speech recognition and natural language processing (LeCun, Bengio, & Hinton, 2015; Schmidhuber, 2015). These architectures process and model information in different ways and have their own advantages and limitations. For instance, ConvNets are able to deal with high-dimensional inputs and learn features that are invariant to small variations and distortions (Krizhevsky, Sutskever, & Hinton, 2012), whereas LSTM-RNNs are able to deal with variable length inputs and model sequential data with long range context (Graves, 2008). In this paper, we investigate the application of end-to-end deep learning to Speech Emotion Recognition (SER) and critically explore how each of these architectures can be employed in this task. ∗ Corresponding author. E-mail addresses: haytham.fayek@ieee.org (H.M. Fayek), margaret.lech@rmit.edu.au (M. Lech), lawrence.cavedon@rmit.edu.au (L. Cavedon). SER can be regarded as a static or dynamic classification problem, which has motivated two popular formulations in the literature to the task (Ververidis & Kotropoulos, 2006): turn-based processing (also known as static modeling), which aims to recognize emotions from a complete utterance; or frame-based processing (also known as dynamic modeling), which aims to recognize emotions at the frame level. In either formulation, SER can be employed in stand-alone applications; e.g. emotion monitoring, or integrated into other systems for emotional awareness; e.g. integrating SER into Automatic Speech Recognition (ASR) to improve its capability in dealing with emotional speech (Cowie et al., 2001; Fayek, Lech, & Cavedon, 2016b; Fernandez, 2004). Frame-based processing is more robust since it does not rely on segmenting the input speech into utterances and can model intra-utterance emotion dynamics (Arias, Busso, & Yoma, 2013; Fayek, Lech, & Cavedon, 2015). However, empirical comparisons between frame-based processing and turn-based processing in prior work have demonstrated the superiority of the latter (Schuller, Vlasenko, Eyben, Rigoll, & Wendemuth, 2009; Vlasenko, Schuller, Wendemuth, & Rigoll, 2007).