TY - GEN
T1 - Emotion recognition from varying length patterns of speech using cnn-based segment-level pyramid match kernel based SVMs
AU - Gupta, Shikha
AU - De, Kishalaya
AU - Dinesh, Dileep Aroor
AU - Thenkanidiyoor, Veena
PY - 2019/2/1
Y1 - 2019/2/1
N2 - Convolutional Neural Networks (CNNs) and its variants have achieved impressive performance when used for different speech processing tasks like spoken language identification, speaker verification, speech emotion recognition, etc. Conventionally, CNNs for speech applications consider input features from fixed duration speech segments as input. In this work, we attempt to consider features from complete speech signal as input to CNN. We propose to use spatial pyramid pooling (SPP) layer in CNN architecture to remove the fixed length constraint and to consider features from varying length speech signals as input to CNN for an end to end training. Proposed architecture also results in varying size set of feature maps from convolution layer. Further, we propose novel CNN-based segment-level pyramid match kernel (CNN-SLPMK) as dynamic kernel between a pair of varying size set of feature maps for the classification framework using support vector machines (SVMs) based classifier. We demonstrate that our proposed approach achieves comparable results with state-of-the-art techniques for speech emotion recognition task.
AB - Convolutional Neural Networks (CNNs) and its variants have achieved impressive performance when used for different speech processing tasks like spoken language identification, speaker verification, speech emotion recognition, etc. Conventionally, CNNs for speech applications consider input features from fixed duration speech segments as input. In this work, we attempt to consider features from complete speech signal as input to CNN. We propose to use spatial pyramid pooling (SPP) layer in CNN architecture to remove the fixed length constraint and to consider features from varying length speech signals as input to CNN for an end to end training. Proposed architecture also results in varying size set of feature maps from convolution layer. Further, we propose novel CNN-based segment-level pyramid match kernel (CNN-SLPMK) as dynamic kernel between a pair of varying size set of feature maps for the classification framework using support vector machines (SVMs) based classifier. We demonstrate that our proposed approach achieves comparable results with state-of-the-art techniques for speech emotion recognition task.
UR - http://www.scopus.com/inward/record.url?scp=85067945774&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85067945774&partnerID=8YFLogxK
U2 - 10.1109/NCC.2019.8732191
DO - 10.1109/NCC.2019.8732191
M3 - Conference contribution
T3 - 25th National Conference on Communications, NCC 2019
BT - 25th National Conference on Communications, NCC 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 25th National Conference on Communications, NCC 2019
Y2 - 20 February 2019 through 23 February 2019
ER -