Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Bringing the avatar to life: Studies and developments in facial communication for virtual agents and robots
KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
2012 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

The work presented in this thesis comes in pursuit of the ultimate goal of building spoken and embodied human-like interfaces that are able to interact with humans under human terms. Such interfaces need to employ the subtle, rich and multidimensional signals of communicative and social value that complement the stream of words – signals humans typically use when interacting with each other.

The studies presented in the thesis concern facial signals used in spoken communication, and can be divided into two connected groups. The first is targeted towards exploring and verifying models of facial signals that come in synchrony with speech and its intonation. We refer to this as visual-prosody, and as part of visual-prosody, we take prominence as a case study. We show that the use of prosodically relevant gestures in animated faces results in a more expressive and human-like behaviour. We also show that animated faces supported with these gestures result in more intelligible speech which in turn can be used to aid communication, for example in noisy environments.

The other group of studies targets facial signals that complement speech. As spoken language is a relatively poor system for the communication of spatial information; since such information is visual in nature. Hence, the use of visual movements of spatial value, such as gaze and head movements, is important for an efficient interaction. The use of such signals is especially important when the interaction between the human and the embodied agent is situated – that is when they share the same physical space, and while this space is taken into account in the interaction.

We study the perception, the modelling, and the interaction effects of gaze and head pose in regulating situated and multiparty spoken dialogues in two conditions. The first is the typical case where the animated face is displayed on flat surfaces, and the second where they are displayed on a physical three-dimensional model of a face. The results from the studies show that projecting the animated face onto a face-shaped mask results in an accurate perception of the direction of gaze that is generated by the avatar, and hence can allow for the use of these movements in multiparty spoken dialogue.

Driven by these findings, the Furhat back-projected robot head is developed. Furhat employs state-of-the-art facial animation that is projected on a 3D printout of that face, and a neck to allow for head movements. Although the mask in Furhat is static, the fact that the animated face matches the design of the mask results in a physical face that is perceived to “move”.

We present studies that show how this technique renders a more intelligible, human-like and expressive face. We further present experiments in which Furhat is used as a tool to investigate properties of facial signals in situated interaction.

Furhat is built to study, implement, and verify models of situated and multiparty, multimodal Human-Machine spoken dialogue, a study that requires that the face is physically situated in the interaction environment rather than in a two-dimensional screen. It also has received much interest from several communities, and been showcased at several venues, including a robot exhibition at the London Science Museum. We present an evaluation study of Furhat at the exhibition where it interacted with several thousand persons in a multiparty conversation. The analysis of the data from the setup further shows that Furhat can accurately regulate multiparty interaction using gaze and head movements.

Ort, förlag, år, upplaga, sidor
Stockholm: KTH Royal Institute of Technology, 2012. , s. xxvi, 96
Serie
Trita-CSC-A, ISSN 1653-5723 ; 2012:15
Nyckelord [en]
Avatar, Speech Communication, Facial animation, Nonverbal, Social, Robot, Human-like, Face-to-face, Prosody, Pitch, Prominence, Furhat, Gaze, Head-pose, Dialogue, Interaction, Multimodal, Multiparty
Nationell ämneskategori
Människa-datorinteraktion (interaktionsdesign)
Forskningsämne
SRA - Informations- och kommunikationsteknik
Identifikatorer
URN: urn:nbn:se:kth:diva-105605ISBN: 978-91-7501-551-4 (tryckt)OAI: oai:DiVA.org:kth-105605DiVA, id: diva2:571532
Disputation
2012-12-07, F3, Lindstedtsvägen 26, KTH, Stockholm, 13:30 (Engelska)
Opponent
Handledare
Anmärkning

QC 20121123

Tillgänglig från: 2012-11-23 Skapad: 2012-11-22 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
Delarbeten
1. Auditory visual prominence From intelligibility to behavior
Öppna denna publikation i ny flik eller fönster >>Auditory visual prominence From intelligibility to behavior
2009 (Engelska)Ingår i: Journal on Multimodal User Interfaces, ISSN 1783-7677, E-ISSN 1783-8738, Vol. 3, nr 4, s. 299-309Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Auditory prominence is defined as when an acoustic segment is made salient in its context. Prominence is one of the prosodic functions that has been shown to be strongly correlated with facial movements. In this work, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study, a speech intelligibility experiment is conducted, speech quality is acoustically degraded and the fundamental frequency is removed from the signal, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrows raise gestures, which are synchronized with the auditory prominence. The experiment shows that presenting prominence as facial gestures significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a follow-up study examining the perception of the behavior of the talking heads when gestures are added over pitch accents. Using eye-gaze tracking technology and questionnaires on 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch accents opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and the understanding of the talking head.

Nyckelord
Prominence, Visual prosody, Gesture, ECA, Eye gaze, Head nod, eyebrows
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-52119 (URN)10.1007/s12193-010-0054-0 (DOI)000208480400005 ()2-s2.0-78649632880 (Scopus ID)
Forskningsfinansiär
Vetenskapsrådet, 2005-3488
Anmärkning

QC 20140926

Tillgänglig från: 2011-12-14 Skapad: 2011-12-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
2. Automatic Prominence Classification in Swedish
Öppna denna publikation i ny flik eller fönster >>Automatic Prominence Classification in Swedish
2010 (Engelska)Ingår i: Proceedings of Speech Prosody 2010, Workshop on Prosodic Prominence, Chicago, USA, 2010Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

This study aims at automatically classifying levels of acoustic prominence on a dataset of 200 Swedish sentences of read speech by one male native speaker. Each word in the sentences was categorized by four speech experts into one of three groups depending on the level of prominence perceived. Six acoustic features at a syllable level and seven features at a word level were used. Two machine learning algorithms, namely Support Vector Machines (SVM) and memory based Learning (MBL) were trained to classify the sentences into their respective classes. The MBL gave an average word level accuracy of 69.08% and the SVM gave an average accuracy of 65.17 % on the test set. These values were comparable with the average accuracy of the human annotators with respect to the average annotations. In this study, word duration was found to be the most important feature required for classifying prominence in Swedish read speech

Ort, förlag, år, upplaga, sidor
Chicago, USA: , 2010
Nyckelord
Swedish prominence, SVM, MBL, syllable and word level features, word duration
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-52120 (URN)
Konferens
Speech Prosody 2010, Workshop on Prosodic Prominence, Chicago, USA
Anmärkning
tmh_import_11_12_14. QC 20111220Tillgänglig från: 2011-12-14 Skapad: 2011-12-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
3. Prominence Detection in Swedish Using Syllable Correlates
Öppna denna publikation i ny flik eller fönster >>Prominence Detection in Swedish Using Syllable Correlates
2010 (Engelska)Ingår i: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, Makuhari, Japan, 2010, s. 1784-1787Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

This paper presents an approach to estimating word level prominence in Swedish using syllable level features. The paper discusses the mismatch problem of annotations between word level perceptual prominence and its acoustic correlates, context, and data scarcity. 200 sentences are annotated by 4 speech experts with prominence on 3 levels. A linear model for feature extraction is proposed on a syllable level features, and weights of these features are optimized to match word level annotations. We show that using syllable level features and estimating weights for the acoustic correlates to minimize the word level estimation error gives better detection accuracy compared to word level features, and that both features exceed the baseline accuracy.

Ort, förlag, år, upplaga, sidor
Makuhari, Japan: , 2010
Nyckelord
Accent, Focus, Prominence, Syllable emphasis
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-52160 (URN)000313086500058 ()2-s2.0-79959856954 (Scopus ID)978-1-61782-123-3 (ISBN)
Konferens
11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, Makuhari, Chiba, Japan, September 26-30, 2010
Anmärkning

tmh_import_11_12_14. QC 20111220

Tillgänglig från: 2011-12-14 Skapad: 2011-12-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
4. Taming Mona Lisa: communicating gaze faithfully in 2D and 3D facial projections
Öppna denna publikation i ny flik eller fönster >>Taming Mona Lisa: communicating gaze faithfully in 2D and 3D facial projections
2012 (Engelska)Ingår i: ACM Transactions on Interactive Intelligent Systems, ISSN 2160-6455, Vol. 1, nr 2, s. 25-, artikel-id 11Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

The perception of gaze plays a crucial role in human-human interaction. Gaze has been shown to matter for a number of aspects of communication and dialogue, especially for managing the flow of the dialogue and participant attention, for deictic referencing, and for the communication of attitude. When developing embodied conversational agents (ECAs) and talking heads, modeling and delivering accurate gaze targets is crucial. Traditionally, systems communicating through talking heads have been displayed to the human conversant using 2D displays, such as flat monitors. This approach introduces severe limitations for an accurate communication of gaze since 2D displays are associated with several powerful effects and illusions, most importantly the Mona Lisa gaze effect, where the gaze of the projected head appears to follow the observer regardless of viewing angle. We describe the Mona Lisa gaze effect and its consequences in the interaction loop, and propose a new approach for displaying talking heads using a 3D projection surface (a physical model of a human head) as an alternative to the traditional flat surface projection. We investigate and compare the accuracy of the perception of gaze direction and the Mona Lisa gaze effect in 2D and 3D projection surfaces in a five subject gaze perception experiment. The experiment confirms that a 3Dprojection surface completely eliminates the Mona Lisa gaze effect and delivers very accurate gaze direction that is independent of the observer's viewing angle. Based on the data collected in this experiment, we rephrase the formulation of the Mona Lisa gaze effect. The data, when reinterpreted, confirms the predictions of the new model for both 2D and 3D projection surfaces. Finally, we discuss the requirements on different spatially interactive systems in terms of gaze direction, and propose new applications and experiments for interaction in a human-ECA and a human-robot settings made possible by this technology.

Ort, förlag, år, upplaga, sidor
Association for Computing Machinery (ACM), 2012
Nyckelord
3D projected avatars, Embodied conversational agents, Gaze perception, Mona Lisa gaze effect, Multiparty dialogue, Robot head, Situated interaction
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-106994 (URN)2-s2.0-84983580854 (Scopus ID)
Anmärkning

 QC 20121210

Tillgänglig från: 2012-12-05 Skapad: 2012-12-05 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
5. Turn-taking Control Using Gaze in Multiparty Human-Computer Dialogue: Effects of 2D and 3D Displays
Öppna denna publikation i ny flik eller fönster >>Turn-taking Control Using Gaze in Multiparty Human-Computer Dialogue: Effects of 2D and 3D Displays
2011 (Engelska)Ingår i: Proceedings of the International Conference on Audio-Visual Speech Processing 2011, Stockholm: KTH Royal Institute of Technology, 2011, s. 99-102Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

In a previous experiment we found that the perception of gazefrom an animated agent on a two-dimensional display suffersfrom the Mona Lisa effect, which means that exclusive mutual gaze cannot be established if there is more than one observer. By using a three-dimensional projection surface, this effect can be eliminated. In this study, we investigate whether this difference also holds for the turn-taking behaviour of subjects interacting with the animated agent in a multi-party dialogue. We present a Wizard-of-Oz experiment where five subjects talk toan animated agent in a route direction dialogue. The results show that the subjects to some extent can infer the intended target of the agent’s questions, in spite of the Mona Lisa effect, but that the accuracy of gaze when it comes to selecting an addressee is still significantly lower in the 2D condition, ascompared to the 3D condition. The response time is also significantly longer in the 2D condition, indicating that the inference of intended gaze may require additional cognitive efforts.

Ort, förlag, år, upplaga, sidor
Stockholm: KTH Royal Institute of Technology, 2011
Serie
Proceedings of the International Conference on Audio-Visual Speech Processing, ISSN 1680-8908 ; 2011
Nyckelord
Turn-taking, Multi-party Dialogue, Gaze, Facial Interaction, Mona Lisa Effect, Facial Projection, Wizard of Oz
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-52205 (URN)978-91-7501-080-9 (ISBN)978-91-7501-079-3 (ISBN)
Konferens
International Conference on Audio-Visual Speech Processing 2011, Aug 31- Sep 3 2011, Volterra, Italy
Anmärkning
tmh_import_11_12_14. QC 20111222Tillgänglig från: 2011-12-14 Skapad: 2011-12-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
6. Furhat at Robotville: A Robot Head Harvesting the Thoughts of the Public through Multi-party Dialogue
Öppna denna publikation i ny flik eller fönster >>Furhat at Robotville: A Robot Head Harvesting the Thoughts of the Public through Multi-party Dialogue
Visa övriga...
2012 (Engelska)Ingår i: Proceedings of the Workshop on Real-time Conversation with Virtual Agents IVA-RCVA, 2012Konferensbidrag, Enbart muntlig presentation (Refereegranskat)
Nationell ämneskategori
Människa-datorinteraktion (interaktionsdesign)
Identifikatorer
urn:nbn:se:kth:diva-105608 (URN)
Konferens
International Conference on Intelligent Virtual Agents
Forskningsfinansiär
ICT - The Next Generation
Anmärkning

QC 20121123

Tillgänglig från: 2012-11-22 Skapad: 2012-11-22 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
7. Furhat: A Back-projected Human-like Robot Head for Multiparty Human-Machine Interaction
Öppna denna publikation i ny flik eller fönster >>Furhat: A Back-projected Human-like Robot Head for Multiparty Human-Machine Interaction
2012 (Engelska)Ingår i: Cognitive Behavioural Systems: COST 2102 International Training School, Dresden, Germany, February 21-26, 2011, Revised Selected Papers / [ed] Anna Esposito, Antonietta M. Esposito, Alessandro Vinciarelli, Rüdiger Hoffmann, Vincent C. Müller, Springer Berlin/Heidelberg, 2012, s. 114-130Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

In this chapter, we first present a summary of findings from two previous studies on the limitations of using flat displays with embodied conversational agents (ECAs) in the contexts of face-to-face human-agent interaction. We then motivate the need for a three dimensional display of faces to guarantee accurate delivery of gaze and directional movements and present Furhat, a novel, simple, highly effective, and human-like back-projected robot head that utilizes computer animation to deliver facial movements, and is equipped with a pan-tilt neck. After presenting a detailed summary on why and how Furhat was built, we discuss the advantages of using optically projected animated agents for interaction. We discuss using such agents in terms of situatedness, environment, context awareness, and social, human-like face-to-face interaction with robots where subtle nonverbal and social facial signals can be communicated. At the end of the chapter, we present a recent application of Furhat as a multimodal multiparty interaction system that was presented at the London Science Museum as part of a robot festival,. We conclude the paper by discussing future developments, applications and opportunities of this technology.

Ort, förlag, år, upplaga, sidor
Springer Berlin/Heidelberg, 2012
Serie
Lecture Notes in Computer Science, ISSN 0302-9743 ; 7403
Nyckelord
Avatar, Back Projection, Dialogue System, Facial Animation, Furhat, Gaze, Gaze Perception, Mona Lisa Effect, Multimodal Interaction, Multiparty Interaction, Robot Heads, Situated Interaction, Talking Heads
Nationell ämneskategori
Människa-datorinteraktion (interaktionsdesign)
Identifikatorer
urn:nbn:se:kth:diva-105606 (URN)10.1007/978-3-642-34584-5_9 (DOI)2-s2.0-84870382387 (Scopus ID)978-364234583-8 (ISBN)
Konferens
International Training School on Cognitive Behavioural Systems, COST 2102; Dresden; 21 February 2011 through 26 February 2011
Forskningsfinansiär
ICT - The Next Generation
Anmärkning

QC 20121123

Tillgänglig från: 2012-11-22 Skapad: 2012-11-22 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
8. Lip-reading: Furhat audio visual intelligibility of a back projected animated face
Öppna denna publikation i ny flik eller fönster >>Lip-reading: Furhat audio visual intelligibility of a back projected animated face
2012 (Engelska)Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer Berlin/Heidelberg, 2012, s. 196-203Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Back projecting a computer animated face, onto a three dimensional static physical model of a face, is a promising technology that is gaining ground as a solution to building situated, flexible and human-like robot heads. In this paper, we first briefly describe Furhat, a back projected robot head built for the purpose of multimodal multiparty human-machine interaction, and its benefits over virtual characters and robotic heads; and then motivate the need to investigating the contribution to speech intelligibility Furhat's face offers. We present an audio-visual speech intelligibility experiment, in which 10 subjects listened to short sentences with degraded speech signal. The experiment compares the gain in intelligibility between lip reading a face visualized on a 2D screen compared to a 3D back-projected face and from different viewing angles. The results show that the audio-visual speech intelligibility holds when the avatar is projected onto a static face model (in the case of Furhat), and even, rather surprisingly, exceeds it. This means that despite the movement limitations back projected animated face models bring about; their audio visual speech intelligibility is equal, or even higher, compared to the same models shown on flat displays. At the end of the paper we discuss several hypotheses on how to interpret the results, and motivate future investigations to better explore the characteristics of visual speech perception 3D projected faces.

Ort, förlag, år, upplaga, sidor
Springer Berlin/Heidelberg, 2012
Serie
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), ISSN 0302-9743 ; 7502 LNAI
Nyckelord
Furhat, Lip reading, Robot Heads, Talking Head, Visual Speech
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-104969 (URN)2-s2.0-84867509147 (Scopus ID)
Konferens
12th International Conference on Intelligent Virtual Agents, IVA 2012, 12 September 2012 through 14 September 2012, Santa Cruz, CA
Forskningsfinansiär
ICT - The Next Generation
Anmärkning

QC 20121114

Tillgänglig från: 2012-11-14 Skapad: 2012-11-14 Senast uppdaterad: 2018-01-12Bibliografiskt granskad
9. Perception of Gaze Direction for Situated Interaction
Öppna denna publikation i ny flik eller fönster >>Perception of Gaze Direction for Situated Interaction
2012 (Engelska)Ingår i: Proceedings of the 4th Workshop on Eye Gaze in Intelligent Human Machine Interaction, Gaze-In 2012, ACM , 2012Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Accurate human perception of robots' gaze direction is crucial for the design of a natural and fluent situated multimodal face-to-face interaction between humans and machines. In this paper, we present an experiment targeted at quantifying the effects of different gaze cues synthesized using the Furhat back-projected robot head, on the accuracy of perceived spatial direction of gaze by humans using 18 test subjects. The study first quantifies the accuracy of the perceived gaze direction in a human-human setup, and compares that to the use of synthesized gaze movements in different conditions: viewing the robot eyes frontal or at a 45 degrees angle side view. We also study the effect of 3D gaze by controlling both eyes to indicate the depth of the focal point (vergence), the use of gaze or head pose, and the use of static or dynamic eyelids. The findings of the study are highly relevant to the design and control of robots and animated agents in situated face-to-face interaction.

Ort, förlag, år, upplaga, sidor
ACM, 2012
Nyckelord
ECA, eyelids, furhat, gaze perception, head pose, robot head, situated interaction, talking head
Nationell ämneskategori
Människa-datorinteraktion (interaktionsdesign)
Identifikatorer
urn:nbn:se:kth:diva-105607 (URN)10.1145/2401836.2401839 (DOI)2-s2.0-84871552405 (Scopus ID)978-145031516-6 (ISBN)
Konferens
4th Workshop on Eye Gaze in Intelligent Human Machine Interaction, Gaze-In 2012; Santa Monica, CA; 26 October 2012 through 26 October 2012
Forskningsfinansiär
ICT - The Next Generation
Anmärkning

QC 20121123

Tillgänglig från: 2012-11-22 Skapad: 2012-11-22 Senast uppdaterad: 2018-01-12Bibliografiskt granskad

Open Access i DiVA

fulltext(1932 kB)1213 nedladdningar
Filinformation
Filnamn FULLTEXT02.pdfFilstorlek 1932 kBChecksumma SHA-512
35d94704feb78d27d28e0551e61deb3ee8021600ed3cf7302a6bc08bdce40fe3a255926537f5a650c1cfc96541177eba7e35bb139244f349f4fe099784952b7e
Typ fulltextMimetyp application/pdf

Övriga länkar

http://www.speech.kth.se/prod/publications/files/3814.pdf

Sök vidare i DiVA

Av författaren/redaktören
Al Moubayed, Samer
Av organisationen
Tal-kommunikation
Människa-datorinteraktion (interaktionsdesign)

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 1213 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

isbn
urn-nbn

Altmetricpoäng

isbn
urn-nbn
Totalt: 2411 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf