kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Audiovisual-to-articulatory inversion
KTH, School of Computer Science and Communication (CSC), Computer Vision and Active Perception, CVAP.ORCID iD: 0000-0002-5750-9655
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.ORCID iD: 0000-0003-4532-014X
2009 (English)In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 51, no 3, p. 195-209Article in journal (Refereed) Published
Abstract [en]

It has been shown that acoustic-to-articulatory inversion, i.e. estimation of the articulatory configuration from the corresponding acoustic signal, can be greatly improved by adding visual features extracted from the speaker's face. In order to make the inversion method usable in a realistic application, these features should be possible to obtain from a monocular frontal face video, where the speaker is not required to wear any special markers. In this study, we investigate the importance of visual cues for inversion. Experiments with motion capture data of the face show that important articulatory information can be extracted using only a few face measures that mimic the information that could be gained from a video-based method. We also show that the depth cue for these measures is not critical, which means that the relevant information can be extracted from a frontal video. A real video-based face feature extraction method is further presented, leading to similar improvements in inversion quality. Rather than tracking points on the face, it represents the appearance of the mouth area using independent component images. These findings are important for applications that need a simple audiovisual-to-articulatory inversion technique, e.g. articulatory phonetics training for second language learners or hearing-impaired persons.

Place, publisher, year, edition, pages
2009. Vol. 51, no 3, p. 195-209
Keywords [en]
Speech inversion, Articulatory inversion, Computer vision, speech recognition
Identifiers
URN: urn:nbn:se:kth:diva-18162DOI: 10.1016/j.specom.2008.07.005ISI: 000263203900001Scopus ID: 2-s2.0-58149191558OAI: oai:DiVA.org:kth-18162DiVA, id: diva2:336208
Note
QC 20100525Available from: 2010-08-05 Created: 2010-08-05 Last updated: 2022-06-25Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Kjellström, Hedvig

Search in DiVA

By author/editor
Kjellström, HedvigEngwall, Olov
By organisation
Computer Vision and Active Perception, CVAPSpeech Communication and Technology
In the same journal
Speech Communication

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 605 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf