kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.ORCID iD: 0000-0002-7801-7617
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.ORCID iD: 0000-0003-1399-6604
2014 (English)In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 28, no 2, p. 607-618Article in journal (Refereed) Published
Abstract [en]

In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a,set of 180 short sentences, under three conditions: normal speech (in quiet), Lombard speech (in noise), and whispering. We then produced an animated 3D avatar with similar shape and appearance as the original talker and used an error minimization procedure to drive the animated version of the talker in a way that matched the original performance as closely as possible. In a perceptual intelligibility study with degraded audio we then compared the animated talker against the real talker and the audio alone, in terms of audio-visual word recognition rate across the three different production conditions. We found that the visual intelligibility of the animated talker was on par with the real talker for the Lombard and whisper conditions. In addition we created two incongruent conditions where normal speech audio was paired with animated Lombard speech or whispering. When compared to the congruent normal speech condition, Lombard animation yields a significant increase in intelligibility, despite the AV-incongruence. In a separate evaluation, we gathered subjective opinions on the different animations, and found that some degree of incongruence was generally accepted.

Place, publisher, year, edition, pages
2014. Vol. 28, no 2, p. 607-618
Keywords [en]
Lombard effect, Motion capture, Speech-reading, Lip-reading, Facial animation, Audio-visual intelligibility
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:kth:diva-141052DOI: 10.1016/j.csl.2013.02.005ISI: 000329415400017Scopus ID: 2-s2.0-84890567121OAI: oai:DiVA.org:kth-141052DiVA, id: diva2:695710
Funder
Swedish Research Council, VR 2010-4646
Note

QC 20140212

Available from: 2014-02-12 Created: 2014-02-07 Last updated: 2025-02-07Bibliographically approved
In thesis
1. Performance, Processing and Perception of Communicative Motion for Avatars and Agents
Open this publication in new window or tab >>Performance, Processing and Perception of Communicative Motion for Avatars and Agents
2017 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Artificial agents and avatars are designed with a large variety of face and body configurations. Some of these (such as virtual characters in films) may be highly realistic and human-like, while others (such as social robots) have considerably more limited expressive means. In both cases, human motion serves as the model and inspiration for the non-verbal behavior displayed. This thesis focuses on increasing the expressive capacities of artificial agents and avatars using two main strategies: 1) improving the automatic capturing of the most communicative areas for human communication, namely the face and the fingers, and 2) increasing communication clarity by proposing novel ways of eliciting clear and readable non-verbal behavior.

The first part of the thesis covers automatic methods for capturing and processing motion data. In paper A, we propose a novel dual sensor method for capturing hands and fingers using optical motion capture in combination with low-cost instrumented gloves. The approach circumvents the main problems with marker-based systems and glove-based systems, and it is demonstrated and evaluated on a key-word signing avatar. In paper B, we propose a robust method for automatic labeling of sparse, non-rigid motion capture marker sets, and we evaluate it on a variety of marker configurations for finger and facial capture. In paper C, we propose an automatic method for annotating hand gestures using Hierarchical Hidden Markov Models (HHMMs).

The second part of the thesis covers studies on creating and evaluating multimodal databases with clear and exaggerated motion. The main idea is that this type of motion is appropriate for agents under certain communicative situations (such as noisy environments) or for agents with reduced expressive degrees of freedom (such as humanoid robots). In paper D, we record motion capture data for a virtual talking head with variable articulation style (normal-to-over articulated). In paper E, we use techniques from mime acting to generate clear non-verbal expressions custom tailored for three agent embodiments (face-and-body, face-only and body-only).

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2017. p. 73
Series
TRITA-CSC-A, ISSN 1653-5723 ; 24
National Category
Computer and Information Sciences
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-218272 (URN)978-91-7729-608-9 (ISBN)
Public defence
2017-12-15, F3, Lindstedtsvägen 26, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20171127

Available from: 2017-11-27 Created: 2017-11-24 Last updated: 2022-06-26Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Alexanderson, SimonBeskow, Jonas

Search in DiVA

By author/editor
Alexanderson, SimonBeskow, Jonas
By organisation
Speech Communication and Technology
In the same journal
Computer speech & language (Print)
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 1328 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf