Functional Analysis of F0 Contours for Recognition of Paralinguistics from Speech

Dmytro Prylipko
Germany, Yandex
For more than a decade, the analysis of human-computer interaction has been extended towards observations of additional information beyond the pure content of communication: affective states, a user’s disposition towards a system, prosody, etc.
The prevalent approach to affective states recognition from speech consists in using prosodic (e.g. related to pitch, duration, intensity, voice quality, etc.) and spectral information in conjunction with general purpose classifiers. To extract features on the chunk or utterance level, various functionals (extremes, regression coefficients, etc.) are usually applied to the raw contours of the above-mentioned low-level descriptors. However, global statistics are not designed to capture local variations that are able to provide useful information about the emotionally colored segments. Besides that, the usage of functionals relies on the assumption of the equal importance of each (at least each voiced) frame, while recent studies have shown the non-uniform distribution of the relevant information over the speech signal.
Motivated by this, Arias et al. proposed a novel framework to detect emotional modulation based on functional data analysis (FDA) – a set methods for statistical analysis of functional data. This approach has been successfully applied to the classification between neutral and emotionally colored speech on both acted and spontaneous data sets.
In this study I present my results on application of the FDA framework to the recognition of naturalistic user reactions from speech – disposition recognition from speech. I compare a feature set of 10 harmonic scores derived from functional principal component analysis (fPCA) to the reference set of 384 features from the INTERSPEECH 2009 Emotion Challenge. My investigations show that the harmonic scores provide better classification accuracy than statistical functionals extracted from pitch, while still being behind the reference feature set. Also, inclusion of the harmonic scores into the reference set was found to be beneficial.
Besides the disposition classification, the applicability of the method to the task of intonation assessment and to the task of prosodic peculiarities recognition (word emphasis, lengthening and hyper-articulation) is investigated.