Change search
ReferencesLink to record
Permanent link

Direct link
Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.ORCID iD: 0000-0002-7801-7617
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.ORCID iD: 0000-0003-1399-6604
2014 (English)In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 28, no 2, 607-618 p.Article in journal (Refereed) Published
Abstract [en]

In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a,set of 180 short sentences, under three conditions: normal speech (in quiet), Lombard speech (in noise), and whispering. We then produced an animated 3D avatar with similar shape and appearance as the original talker and used an error minimization procedure to drive the animated version of the talker in a way that matched the original performance as closely as possible. In a perceptual intelligibility study with degraded audio we then compared the animated talker against the real talker and the audio alone, in terms of audio-visual word recognition rate across the three different production conditions. We found that the visual intelligibility of the animated talker was on par with the real talker for the Lombard and whisper conditions. In addition we created two incongruent conditions where normal speech audio was paired with animated Lombard speech or whispering. When compared to the congruent normal speech condition, Lombard animation yields a significant increase in intelligibility, despite the AV-incongruence. In a separate evaluation, we gathered subjective opinions on the different animations, and found that some degree of incongruence was generally accepted.

Place, publisher, year, edition, pages
2014. Vol. 28, no 2, 607-618 p.
Keyword [en]
Lombard effect, Motion capture, Speech-reading, Lip-reading, Facial animation, Audio-visual intelligibility
National Category
Language Technology (Computational Linguistics)
URN: urn:nbn:se:kth:diva-141052DOI: 10.1016/j.csl.2013.02.005ISI: 000329415400017ScopusID: 2-s2.0-84890567121OAI: diva2:695710
Swedish Research Council, VR 2010-4646

QC 20140212

Available from: 2014-02-12 Created: 2014-02-07 Last updated: 2014-02-12Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textScopus

Search in DiVA

By author/editor
Alexanderson, SimonBeskow, Jonas
By organisation
Speech Communication and Technology
In the same journal
Computer speech & language (Print)
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Altmetric score

Total: 62 hits
ReferencesLink to record
Permanent link

Direct link