Conversation is the most common use of speech. Any automatic dialog system, pretending to mimic a human, must be able to successfully detect typical sounds and meanings of spontaneous conversational speech. Automatic transcription of the function of linguistic units, sometimes refereed to as Dialog Acts (DAs), Cue Phrases or Discourse Markers is an emerging area of research. This can be done on a pure lexical level, or by using prosody alone (Laskowski and Shriberg, 2010; Goto et al., 1999), or a combination of thereof (Sridhar et al., 2009; Gravano et al., 2007). However, it is not straightforward to train a language model for non-verbal content (e.g. “mm”, “mhm”, “eh”, “em”), not only since it is questionable if these sounds are words, but also because of lack of standardized annotation schemes. Ward (2000) refer to these tokens as conversational grunts, which is also the scope of this study. Feedback tokens are usually sub-divided into yes/no answers, backchannels and acknowledgments. In this study, it is the attitude of the response which is the focus of interest. Thus, the cut is instead made between dis-preference, news receiving and general feedback. These are further subdivided into their turn-taking effect: Other speaker, Same speaker and Simultaneous start. This allows us to verify if conversational grunts are simply carriers of prosodic information. In this study, we use a supra-segmental prosodic signal representation based on Time Varying Constant-Q Cepstral Coefficients (TVCQCC) introduced in (Neiberg et al., 2010), for classification and intuitive visualization of feedback and fillers. The contribution of the end of interlocutor left context for predicting turn taking effect has been studied for a while (Duncan, 1972) and is also addressed in this study. In addition, we examine the effect of contextual timing features, which has been shown to be useful in DAs recognition (Laskowski and Shriberg, 2010). We use the Swedish DEAL corpus which has annotated fillers and feedback attitudes. Classification results using linear discriminant analysis are presented. It was found that feedbacks followed by a clean floor taking lose some of their prosodic cues which signal attitude compared to a clean continuer feedback. Turn taking effects can be predicted well over chance level, while Simultaneous Start can’t be predicted at all. However, feedback tokens before Simultaneous Starts were found to be more equal feedback continuers than turn initial feedback tokens, which may be explained as inappropriate floor stealing attempts from the feedback producing speaker. An analysis based on the prototypical spectrograms closely follows the results for Bad News (Dispreference) vs Good news (News reciving) found in Freese and Maynard (1998) although the defnitions differ slightly.
Fonetik 2010. Lund, Sweden. June 2–4, 2010
tmh_import_11_12_14. QC 20111222