Predicting Speaker Changes and Listener Responses With And Without Eye-contact
2011 (English)In: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy., 2011, 1576-1579 p.Conference paper (Refereed)
This paper compares turn-taking in terms of timing and prediction in human-human conversations under the conditions when participants has eye-contact versus when there is no eyecontact, as found in the HCRC Map Task corpus. By measuring between speaker intervals it was found that a larger proportion of speaker shifts occurred in overlap for the no eyecontact condition. For prediction we used prosodic and spectral features parametrized by time-varying length-invariant discrete cosine coefficients. With Gaussian Mixture Modeling and variations of classifier fusion schemes, we explored the task of predicting whether there is an upcoming speaker change (SC) or not (HOLD), at the end of an utterance (EOU) with a pause lag of 200 ms. The label SC was further split into LRs (listener responses, e.g. back-channels) and other TURNSHIFTs. The prediction was found to be somewhat easier for the eye-contact condition, for which the average recall rates was 60.57%, 66.35%, and 62.00% for TURN-SHIFTs, LR and SC respectively.
Place, publisher, year, edition, pages
Florence, Italy., 2011. 1576-1579 p.
Computer Science Language Technology (Computational Linguistics)
IdentifiersURN: urn:nbn:se:kth:diva-52195ISI: 000316502200396ScopusID: 2-s2.0-84865794088ISBN: 978-1-61839-270-1OAI: oai:DiVA.org:kth-52195DiVA: diva2:465493
INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy
tmh_import_11_12_14 QC 201112162011-12-142011-12-142014-01-16Bibliographically approved