Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Trainable articulatory control models for visual speech synthesis
KTH, Tidigare Institutioner                               , Tal, musik och hörsel.ORCID-id: 0000-0003-1399-6604
2004 (Engelska)Ingår i: International Journal of Speech Technology, ISSN 1381-2416, E-ISSN 1572-8110, Vol. 7, nr 4, s. 335-349Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

This paper deals with the problem of modelling the dynamics of articulation for a parameterised talkinghead based on phonetic input. Four different models are implemented and trained to reproduce the articulatorypatterns of a real speaker, based on a corpus of optical measurements. Two of the models, (“Cohen-Massaro”and “O¨ hman”) are based on coarticulation models from speech production theory and two are based on artificialneural networks, one of which is specially intended for streaming real-time applications. The different models areevaluated through comparison between predicted and measured trajectories, which shows that the Cohen-Massaromodel produces trajectories that best matches the measurements. A perceptual intelligibility experiment is alsocarried out, where the four data-driven models are compared against a rule-based model as well as an audio-alonecondition. Results show that all models give significantly increased speech intelligibility over the audio-alone case,with the rule-based model yielding highest intelligibility score.

Ort, förlag, år, upplaga, sidor
Boston: Kluwer Academic Publishers , 2004. Vol. 7, nr 4, s. 335-349
Nyckelord [en]
speech synthesis, facial animation, coarticulation, artificial neural networks, perceptual evaluation
Identifikatorer
URN: urn:nbn:se:kth:diva-12803Scopus ID: 2-s2.0-4143072802OAI: oai:DiVA.org:kth-12803DiVA, id: diva2:318898
Anmärkning
QC 20100511Tillgänglig från: 2010-05-11 Skapad: 2010-05-11 Senast uppdaterad: 2017-12-12Bibliografiskt granskad
Ingår i avhandling
1. Talking Heads - Models and Applications for Multimodal Speech Synthesis
Öppna denna publikation i ny flik eller fönster >>Talking Heads - Models and Applications for Multimodal Speech Synthesis
2003 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

This thesis presents work in the area of computer-animatedtalking heads. A system for multimodal speech synthesis hasbeen developed, capable of generating audiovisual speechanimations from arbitrary text, using parametrically controlled3D models of the face and head. A speech-specific directparameterisation of the movement of the visible articulators(lips, tongue and jaw) is suggested, along with a flexiblescheme for parameterising facial surface deformations based onwell-defined articulatory targets.

To improve the realism and validity of facial and intra-oralspeech movements, measurements from real speakers have beenincorporated from several types of static and dynamic datasources. These include ultrasound measurements of tonguesurface shape, dynamic optical motion tracking of face pointsin 3D, as well as electromagnetic articulography (EMA)providing dynamic tongue movement data in 2D. Ultrasound dataare used to estimate target configurations for a complex tonguemodel for a number of sustained articulations. Simultaneousoptical and electromagnetic measurements are performed and thedata are used to resynthesise facial and intra-oralarticulation in the model. A robust resynthesis procedure,capable of animating facial geometries that differ in shapefrom the measured subject, is described.

To drive articulation from symbolic (phonetic) input, forexample in the context of a text-to-speech system, bothrule-based and data-driven articulatory control models havebeen developed. The rule-based model effectively handlesforward and backward coarticulation by targetunder-specification, while the data-driven model uses ANNs toestimate articulatory parameter trajectories, trained ontrajectories resynthesised from optical measurements. Thearticulatory control models are evaluated and compared againstother data-driven models trained on the same data. Experimentswith ANNs for driving the articulation of a talking headdirectly from acoustic speech input are also reported.

A flexible strategy for generation of non-verbal facialgestures is presented. It is based on a gesture libraryorganised by communicative function, where each function hasmultiple alternative realisations. The gestures can be used tosignal e.g. turn-taking, back-channelling and prominence whenthe talking head is employed as output channel in a spokendialogue system. A device independent XML-based formalism fornon-verbal and verbal output in multimodal dialogue systems isproposed, and it is described how the output specification isinterpreted in the context of a talking head and converted intofacial animation using the gesture library.

Through a series of audiovisual perceptual experiments withnoise-degraded audio, it is demonstrated that the animatedtalking head provides significantly increased intelligibilityover the audio-only case, in some cases not significantly belowthat provided by a natural face.

Finally, several projects and applications are presented,where the described talking head technology has beensuccessfully employed. Four different multimodal spokendialogue systems are outlined, and the role of the talkingheads in each of the systems is discussed. A telecommunicationapplication where the talking head functions as an aid forhearing-impaired users is also described, as well as a speechtraining application where talking heads and languagetechnology are used with the purpose of improving speechproduction in profoundly deaf children.

Ort, förlag, år, upplaga, sidor
Institutionen för talöverföring och musikakustik, 2003. s. viii, 63
Serie
Trita-TMH ; 2003:7
Nyckelord
Talking heads, facial animation, speech synthesis, coarticulation, intelligibility, embodied conversational agents
Identifikatorer
urn:nbn:se:kth:diva-3561 (URN)91-7283-536-2 (ISBN)
Disputation
2003-06-11, 00:00 (Engelska)
Anmärkning
QC 20100506Tillgänglig från: 2003-06-26 Skapad: 2003-06-26 Senast uppdaterad: 2010-05-11Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Scopus

Personposter BETA

Beskow, Jonas

Sök vidare i DiVA

Av författaren/redaktören
Beskow, Jonas
Av organisationen
Tal, musik och hörsel
I samma tidskrift
International Journal of Speech Technology

Sök vidare utanför DiVA

GoogleGoogle Scholar

urn-nbn

Altmetricpoäng

urn-nbn
Totalt: 259 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf