感情認識のための音響特徴と語彙特徴の感情的潜在表現

/ /

日本語AIでPubMedを検索

PubMedの提供する医学論文データベースを日本語で検索できます。AI(Deep Learning)を活用した機械翻訳エンジンにより、精度高く日本語へ翻訳された論文をご参照いただけます。

Sensors (Basel).2020 May;20(9). E2614. doi: 10.3390/s20092614.Epub 2020-05-04.

感情認識のための音響特徴と語彙特徴の感情的潜在表現

Affective Latent Representation of Acoustic and Lexical Features for Emotion Recognition.

Eesung Kim
Hyungchan Song
Jong Won Shin

PMID: 32375342 PMCID: PMC7248815. DOI: 10.3390/s20092614.

抄録

本論文では、音響特徴量と語彙特徴量を入力とし、条件付き自動符号化器(CAAE)から抽出された基本的な感情特徴量に基づいた新しい感情認識手法を提案する。音響特徴量は、低レベル記述子の統計関数を計算し、ディープニューラルネットワーク（DNN）によって生成されます。これらの音響特徴は、テキストから抽出された3種類の語彙特徴（疎な表現、分散表現、感情語彙に基づく次元）と結合される。CAAEによって、価覚-有声空間のベクトルに似た2次元の潜在表現が得られ、高度な分類器を必要とせずに感情クラスに直接マッピングすることができる。これまでの音響特徴のみを用いたCAAEの試みとは対照的に，音響特徴と語彙特徴を組み合わせることで十分な識別力を得ることができるため，感情認識の性能を向上させることが可能である．IEMOCAP(Interactive Emotional Dyadic Motion Capture)コーパスを用いた実験結果では、本手法は、以前に報告された同コーパスでの最良の結果を上回り、非加重平均リコールで76.72%を達成することが示された。

In this paper, we propose a novel emotion recognition method based on the underlying emotional characteristics extracted from a conditional adversarial auto-encoder (CAAE), in which both acoustic and lexical features are used as inputs. The acoustic features are generated by calculating statistical functionals of low-level descriptors and by a deep neural network (DNN). These acoustic features are concatenated with three types of lexical features extracted from the text, which are a sparse representation, a distributed representation, and an affective lexicon-based dimensions. Two-dimensional latent representations similar to vectors in the valence-arousal space are obtained by a CAAE, which can be directly mapped into the emotional classes without the need for a sophisticated classifier. In contrast to the previous attempt to a CAAE using only acoustic features, the proposed approach could enhance the performance of the emotion recognition because combined acoustic and lexical features provide enough discriminant power. Experimental results on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus showed that our method outperformed the previously reported best results on the same corpus, achieving 76.72% in the unweighted average recall.