Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Talking Heads - Models and Applications for Multimodal Speech Synthesis
KTH, Tidigare Institutioner                               , Tal, musik och hörsel.ORCID-id: 0000-0003-1399-6604
2003 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

This thesis presents work in the area of computer-animatedtalking heads. A system for multimodal speech synthesis hasbeen developed, capable of generating audiovisual speechanimations from arbitrary text, using parametrically controlled3D models of the face and head. A speech-specific directparameterisation of the movement of the visible articulators(lips, tongue and jaw) is suggested, along with a flexiblescheme for parameterising facial surface deformations based onwell-defined articulatory targets.

To improve the realism and validity of facial and intra-oralspeech movements, measurements from real speakers have beenincorporated from several types of static and dynamic datasources. These include ultrasound measurements of tonguesurface shape, dynamic optical motion tracking of face pointsin 3D, as well as electromagnetic articulography (EMA)providing dynamic tongue movement data in 2D. Ultrasound dataare used to estimate target configurations for a complex tonguemodel for a number of sustained articulations. Simultaneousoptical and electromagnetic measurements are performed and thedata are used to resynthesise facial and intra-oralarticulation in the model. A robust resynthesis procedure,capable of animating facial geometries that differ in shapefrom the measured subject, is described.

To drive articulation from symbolic (phonetic) input, forexample in the context of a text-to-speech system, bothrule-based and data-driven articulatory control models havebeen developed. The rule-based model effectively handlesforward and backward coarticulation by targetunder-specification, while the data-driven model uses ANNs toestimate articulatory parameter trajectories, trained ontrajectories resynthesised from optical measurements. Thearticulatory control models are evaluated and compared againstother data-driven models trained on the same data. Experimentswith ANNs for driving the articulation of a talking headdirectly from acoustic speech input are also reported.

A flexible strategy for generation of non-verbal facialgestures is presented. It is based on a gesture libraryorganised by communicative function, where each function hasmultiple alternative realisations. The gestures can be used tosignal e.g. turn-taking, back-channelling and prominence whenthe talking head is employed as output channel in a spokendialogue system. A device independent XML-based formalism fornon-verbal and verbal output in multimodal dialogue systems isproposed, and it is described how the output specification isinterpreted in the context of a talking head and converted intofacial animation using the gesture library.

Through a series of audiovisual perceptual experiments withnoise-degraded audio, it is demonstrated that the animatedtalking head provides significantly increased intelligibilityover the audio-only case, in some cases not significantly belowthat provided by a natural face.

Finally, several projects and applications are presented,where the described talking head technology has beensuccessfully employed. Four different multimodal spokendialogue systems are outlined, and the role of the talkingheads in each of the systems is discussed. A telecommunicationapplication where the talking head functions as an aid forhearing-impaired users is also described, as well as a speechtraining application where talking heads and languagetechnology are used with the purpose of improving speechproduction in profoundly deaf children.

Ort, förlag, år, upplaga, sidor
Institutionen för talöverföring och musikakustik , 2003. , s. viii, 63
Serie
Trita-TMH ; 2003:7
Nyckelord [en]
Talking heads, facial animation, speech synthesis, coarticulation, intelligibility, embodied conversational agents
Identifikatorer
URN: urn:nbn:se:kth:diva-3561ISBN: 91-7283-536-2 (tryckt)OAI: oai:DiVA.org:kth-3561DiVA, id: diva2:9380
Disputation
2003-06-11, 00:00 (Engelska)
Anmärkning
QC 20100506Tillgänglig från: 2003-06-26 Skapad: 2003-06-26 Senast uppdaterad: 2010-05-11Bibliografiskt granskad
Delarbeten
1. RULE-BASED VISUAL SPEECH SYNTHESIS
Öppna denna publikation i ny flik eller fönster >>RULE-BASED VISUAL SPEECH SYNTHESIS
1995 (Engelska)Ingår i: Proceedings of the 4th European Conference on Speech Communication and Technology, Madris, Spain, 1995, s. 299-302Konferensbidrag, Publicerat paper (Övrigt vetenskapligt)
Abstract [en]

A system for rule based audiovisual text-to-speech synthesishas been created. The system is based on the KTHtext-to-speech system which has been complementedwith a three-dimensional parameterized model of a humanface. The face can be animated in real time, synchronizedwith the auditory speech. The facial model iscontrolled by the same synthesis software as the auditoryspeech synthesizer. A set of rules that takes coarticulationinto account has been developed. The audiovisualtext-to-speech system has also been incorporated into aspoken man-machine dialogue system that is being developedat the department.

Ort, förlag, år, upplaga, sidor
Madris, Spain: , 1995
Identifikatorer
urn:nbn:se:kth:diva-12693 (URN)
Konferens
EUROSPEECH ‘95. 4th European Conference on Speech Communication and Technology
Anmärkning
QC 20100506Tillgänglig från: 2010-05-06 Skapad: 2010-05-06 Senast uppdaterad: 2010-05-11Bibliografiskt granskad
2. ANIMATION OF TALKING AGENTS
Öppna denna publikation i ny flik eller fönster >>ANIMATION OF TALKING AGENTS
1997 (Engelska)Ingår i: Proceedings of International Conference on Auditory-Visual Speech Processing / [ed] Benoït, C & Campbell, R, Rhodos, Greece, 1997, s. 149-152Konferensbidrag, Publicerat paper (Övrigt vetenskapligt)
Abstract [en]

It is envisioned that autonomous software agents that cancommunicate using speech and gesture will soon be oneverybody’s computer screen. This paper describes anarchitecture that can be used to design and animate characterscapable of lip-synchronised synthetic speech as well as bodygestures, for use in for example spoken dialogue systems. Ageneral scheme for computationally efficient parametricdeformation of facial surfaces is presented, as well as techniques for generation of bimodal speech, facial expressionsand body gestures in a spoken dialogue system. Resultsindicating that an animated cartoon-like character can be asignificant contribution to speech intelligibility, are also reported.

Ort, förlag, år, upplaga, sidor
Rhodos, Greece: , 1997
Nationell ämneskategori
Naturvetenskap
Identifikatorer
urn:nbn:se:kth:diva-12709 (URN)
Konferens
International Conference on Auditory-Visual Speech Processing
Anmärkning
QC 20100507Tillgänglig från: 2010-05-07 Skapad: 2010-05-07 Senast uppdaterad: 2010-05-11Bibliografiskt granskad
3. RECENT DEVELOPMENTS IN FACIAL ANIMATION: AN INSIDE VIEW
Öppna denna publikation i ny flik eller fönster >>RECENT DEVELOPMENTS IN FACIAL ANIMATION: AN INSIDE VIEW
1998 (Engelska)Ingår i: Proceedings of International Conference on Auditory-Visual Speech Processing / [ed] Burnham, D., Robert-Ribes, J. & Vatikiotis-Bateson, E., 1998, s. 201-206Konferensbidrag, Publicerat paper (Övrigt vetenskapligt)
Abstract [en]

We report on our recent facial animation work to improve the realism and accuracy of visual speech synthesis. The general approach is to use both staticand dynamic observations of natural speech to guidethe facial modeling. One current goal is to model the internal articulators of a highly realistic palate, teeth, and an improved tongue. Because our talkinghead can be made transparent, we can provide ananatomically valid and pedagogically useful displaythat can be used in speech training of children withhearing loss [1]. High-resolution models of palateand teeth [2] were reduced to a relatively smallnumber of polygons for real-time animation [3]. Forthe improved tongue, we are using 3D ultrasound data and electropalatography (EPG) [4] with errorminimization algorithms to educate our parametricB-spline based tongue model to simulate realisticspeech. In addition, a high-speed algorithm has beendeveloped for detection and correction of collisions, to prevent the tongue from protruding through the palate and teeth, and to enable the real-time displayof synthetic EPG patterns.

Identifikatorer
urn:nbn:se:kth:diva-12711 (URN)
Konferens
International Conference on Auditory-Visual Speech Processing
Anmärkning
QC 20100507Tillgänglig från: 2010-05-07 Skapad: 2010-05-07 Senast uppdaterad: 2010-05-11Bibliografiskt granskad
4. Picture My Voice: Audio to Visual Speech Synthesis using Artificial Neural Networks
Öppna denna publikation i ny flik eller fönster >>Picture My Voice: Audio to Visual Speech Synthesis using Artificial Neural Networks
Visa övriga...
1999 (Engelska)Ingår i: Proceedings of International Conference on Auditory-Visual Speech Processing / [ed] Massaro, Dominic W., 1999, s. 133-138Konferensbidrag, Publicerat paper (Övrigt vetenskapligt)
Abstract [en]

This paper presents an initial implementation and evaluation  of  a  system  that  synthesizes  visualspeech  directly  from  the  acoustic waveform. Anartificial  neural  network  (ANN)  was  trained  tomap  the  cepstral  coefficients  of  an  individual’snatural  speech  to  the  control  parameters  of  ananimated  synthetic  talking  head. We  trained  ontwo data sets; one was a set of 400 words spokenin  isolation  by  a  single  speaker  and  the  other  a subset  of  extemporaneous  speech  from  10different speakers. The system showed learning inboth cases. A perceptual evaluation test indicatedthat the system’s generalization  to new words bythe  same  speaker  provides  significant  visible information, but significantly below that given bya text-to-speech algorithm.

Identifikatorer
urn:nbn:se:kth:diva-12710 (URN)
Konferens
International Conference on Auditory-Visual Speech Processing
Anmärkning
QC 20100507Tillgänglig från: 2010-05-07 Skapad: 2010-05-07 Senast uppdaterad: 2010-05-11Bibliografiskt granskad
5. A Model for Multimodal Dialogue System Output Applied to an Animated Talking Head
Öppna denna publikation i ny flik eller fönster >>A Model for Multimodal Dialogue System Output Applied to an Animated Talking Head
2005 (Engelska)Ingår i: SPOKEN MULTIMODAL HUMAN-COMPUTER DIALOGUE IN MOBILE ENVIRONMENTS / [ed] Minker, Wolfgang; Bühler, Dirk; Dybkjær, Laila, Dordrecht: Springer , 2005, s. 93-113Kapitel i bok, del av antologi (Refereegranskat)
Abstract [en]

We present a formalism for specifying verbal and non-verbal output from a multimodal dialogue system. The output specification is XML-based and provides information about communicative functions of the output, without detailing the realisation of these functions. The aim is to let dialogue systems generate the same output for a wide variety of output devices and modalities. The formalism was developed and implemented in the multimodal spoken dialogue system AdApt. We also describe how facial gestures in the 3D-animated talking head used within this system are controlled through the formalism.

Ort, förlag, år, upplaga, sidor
Dordrecht: Springer, 2005
Serie
Text Speech and Language Technology, ISSN 1386-291X ; 28
Nyckelord
GESOM, AdApt, Standards, XML, 3D-animation, Gesture, Turn-taking, Lip synchronisation
Nationell ämneskategori
Data- och informationsvetenskap Jämförande språkvetenskap och allmän lingvistik
Identifikatorer
urn:nbn:se:kth:diva-12751 (URN)10.1007/1-4020-3075-4_6 (DOI)000270447900008 ()978-1-4020-3075-8 (ISBN)
Anmärkning
QC 20100510 ISCA Tutorial and Research Workshop on Multi-Modal Dialogue in Mobile Environments, Kloster Irsee, GERMANY, 2002Tillgänglig från: 2010-05-10 Skapad: 2010-05-10 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
6. Evaluation of a Multilingual Synthetic Talking Faceas a Communication Aid for the Hearing Impaired
Öppna denna publikation i ny flik eller fönster >>Evaluation of a Multilingual Synthetic Talking Faceas a Communication Aid for the Hearing Impaired
2003 (Engelska)Ingår i: Proceedings of the 15th International Congress of Phonetic Science (ICPhS'03), Barcelona, Spanien, 2003, s. 131-134Konferensbidrag, Publicerat paper (Övrigt vetenskapligt)
Ort, förlag, år, upplaga, sidor
Barcelona, Spanien: , 2003
Identifikatorer
urn:nbn:se:kth:diva-12769 (URN)1-876346-48-5 (ISBN)
Konferens
15th International Congress of Phonetic Science (ICPhS'03)
Anmärkning
QC 20100510Tillgänglig från: 2010-05-10 Skapad: 2010-05-10 Senast uppdaterad: 2010-05-11Bibliografiskt granskad
7. Resynthesis of Facial and Intraoral Articulation fromSimultaneous Measurements
Öppna denna publikation i ny flik eller fönster >>Resynthesis of Facial and Intraoral Articulation fromSimultaneous Measurements
2003 (Engelska)Ingår i: Proceedings of the 15th International Congress of phonetic Sciences (ICPhS'03), Adelaide: Casual Productions , 2003Konferensbidrag, Publicerat paper (Övrigt vetenskapligt)
Abstract [en]

Simultaneous measurements of tongue and facial motion,using a combination of electromagnetic articulography(EMA) and optical motion tracking, are analysed to improvethe articulation of an animated talking head and toinvestigate the correlation between facial and vocal tractmovement. The recorded material consists of VCV andCVC words and 270 short everyday sentences spoken byone Swedish subject. The recorded articulatory movementsare re-synthesised by a parametrically controlled 3D modelof the face and tongue, using a procedure involvingminimisation of the error between measurement and model.Using linear estimators, tongue data is predicted from theface and vice versa, and the correlation betweenmeasurement and prediction is computed.

Ort, förlag, år, upplaga, sidor
Adelaide: Casual Productions, 2003
Nationell ämneskategori
Teknik och teknologier
Identifikatorer
urn:nbn:se:kth:diva-12798 (URN)1-876346-49-3 (ISBN)
Konferens
15th International Congress of phonetic Sciences (ICPhS'03)
Anmärkning
QC 20100511Tillgänglig från: 2010-05-11 Skapad: 2010-05-11 Senast uppdaterad: 2010-05-11Bibliografiskt granskad
8. Trainable articulatory control models for visual speech synthesis
Öppna denna publikation i ny flik eller fönster >>Trainable articulatory control models for visual speech synthesis
2004 (Engelska)Ingår i: International Journal of Speech Technology, ISSN 1381-2416, E-ISSN 1572-8110, Vol. 7, nr 4, s. 335-349Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

This paper deals with the problem of modelling the dynamics of articulation for a parameterised talkinghead based on phonetic input. Four different models are implemented and trained to reproduce the articulatorypatterns of a real speaker, based on a corpus of optical measurements. Two of the models, (“Cohen-Massaro”and “O¨ hman”) are based on coarticulation models from speech production theory and two are based on artificialneural networks, one of which is specially intended for streaming real-time applications. The different models areevaluated through comparison between predicted and measured trajectories, which shows that the Cohen-Massaromodel produces trajectories that best matches the measurements. A perceptual intelligibility experiment is alsocarried out, where the four data-driven models are compared against a rule-based model as well as an audio-alonecondition. Results show that all models give significantly increased speech intelligibility over the audio-alone case,with the rule-based model yielding highest intelligibility score.

Ort, förlag, år, upplaga, sidor
Boston: Kluwer Academic Publishers, 2004
Nyckelord
speech synthesis, facial animation, coarticulation, artificial neural networks, perceptual evaluation
Identifikatorer
urn:nbn:se:kth:diva-12803 (URN)2-s2.0-4143072802 (Scopus ID)
Anmärkning
QC 20100511Tillgänglig från: 2010-05-11 Skapad: 2010-05-11 Senast uppdaterad: 2017-12-12Bibliografiskt granskad

Open Access i DiVA

fulltext(2361 kB)1455 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 2361 kBChecksumma SHA-1
c655e34218382d5ecc92d8554b23edf26df2e9afe7c52c8ad68e5fac45566ab40d59bd19
Typ fulltextMimetyp application/pdf

Personposter BETA

Beskow, Jonas

Sök vidare i DiVA

Av författaren/redaktören
Beskow, Jonas
Av organisationen
Tal, musik och hörsel

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 1455 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

isbn
urn-nbn

Altmetricpoäng

isbn
urn-nbn
Totalt: 2887 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf