Change search
Refine search result
1234567 1 - 50 of 1039
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the 'Create feeds' function.
  • 1. Abrahamsson, M.
    et al.
    Sundberg, Johan
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Music Acoustics.
    Subglottal pressure variation in actors’ stage speech2007In: Voice and Gender Journal for the Voice and Speech Trainers Association / [ed] Rees, M., VASTA Publishing , 2007, 343-347 p.Chapter in book (Refereed)
  • 2. Addessi, Anna Rita
    et al.
    Anelli, Filomena
    Benghi, Diber
    Friberg, Anders
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Child-Computer Interaction at the Beginner Stage of Music Learning: Effects of Reflexive Interaction on Children's Musical Improvisation2017In: Frontiers in Psychology, ISSN 1664-1078, E-ISSN 1664-1078, Vol. 8, 65Article in journal (Refereed)
    Abstract [en]

    In this article childrens musical improvisation is investigated through the reflexive interaction paradigm. We used a particular system, the MIROR-Impro, implemented in the framework of the MIROR project (EC-FP7), which is able to reply to the child playing a keyboard by a reflexive output, mirroring (with repetitions and variations) her/his inputs. The study was conducted in a public primary school, with 47 children, aged 6-7. The experimental design used the convergence procedure, based on three sample groups allowing us to verify if the reflexive interaction using the MIROR-Impro is necessary and/or sufficient to improve the childrens abilities to improvise. The following conditions were used as independent variables: to play only the keyboard, the keyboard with the MIROR-Impro but with not-reflexive reply, the keyboard with the MIROR-Impro with reflexive reply. As dependent variables we estimated the childrens ability to improvise in solos, and in duets. Each child carried out a training program consisting of 5 weekly individual 12 min sessions. The control group played the complete package of independent variables; Experimental Group 1 played the keyboard and the keyboard with the MIROR-Impro with not-reflexive reply; Experimental Group 2 played only the keyboard with the reflexive system. One week after, the children were asked to improvise a musical piece on the keyboard alone (Solo task), and in pairs with a friend (Duet task). Three independent judges assessed the Solo and the Duet tasks by means of a grid based on the TAI-Test for Ability to Improvise rating scale. The EG2, which trained only with the reflexive system, reached the highest average results and the difference with EG1, which did not used the reflexive system, is statistically significant when the children improvise in a duet. The results indicate that in the sample of participants the reflexive interaction alone could be sufficient to increase the improvisational skills, and necessary when they improvise in duets. However, these results are in general not statistically significant. The correlation between Reflexive Interaction and the ability to improvise is statistically significant. The results are discussed on the light of the recent literature in neuroscience and music education.

  • 3.
    Agelfors, Eva
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Karlsson, Inger
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Kewley, Jo
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Thomas, Neil
    User evaluation of the SYNFACE talking head telephone2006In: Computers Helping People With Special Needs, Proceedings / [ed] Miesenberger, K; Klaus, J; Zagler, W; Karshmer, A, 2006, Vol. 4061, 579-586 p.Conference paper (Refereed)
    Abstract [en]

    The talking-head telephone, Synface, is a lip-reading support for people with hearing-impairment. It has been tested by 49 users with varying degrees of hearing-impaired in UK and Sweden in lab and home environments. Synface was found to give support to the users, especially in perceiving numbers and addresses and an enjoyable way to communicate. A majority deemed Synface to be a useful product.

  • 4.
    Ahmad, Rezgar
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Con curdino – Utökning av tvärflöjtens tonförråd till kvartstoner genom perturbation av tonhålens geometri.2011Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    The thesis describes the development and evaluation of a component ('kurdin') that provides flutists the opportunity to play quarter tones on the regular orchestra flute (Böhm flute). The flutist does not need a new instrument or learn a new playing technique.

    Quarter tones were obtained by reducing the diameter of some of the tone holes. The hole diameter is reduced by using the kurdin which is entered inside the flute. The kurdin consists of two parts which can reduce the diameter of two tone holes at the same time. Flutists can thereby extend the available set of scale notes with two quarter tones without changing the position of the kurdin.

    Systematic measurements of the fundamental frequency of four tones (E4, F^#4, A4 and B4) using an artificially blown flute and different versions of the kurdin showed that the pitch of these tones can be lowered a quarter tone (50 cents) with the same reduced hole diameter. The four quarter tones cover almost all scales called Maqamat in oriental music.

  • 5. Al Moubayed, S.
    et al.
    Bohus, D.
    Esposito, A.
    Heylen, D.
    Koutsombogera, M.
    Papageorgiou, H.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    UM3I 2014 chairs' welcome2014In: UM3I 2014 - Proceedings of the 2014 ACM Workshop on Understanding and Modeling Multiparty, Multimodal Interactions, Co-located with ICMI 2014, 2014, iii- p.Conference paper (Other (popular science, discussion, etc.))
  • 6.
    Al Moubayed, Samer
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Bringing the avatar to life: Studies and developments in facial communication for virtual agents and robots2012Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    The work presented in this thesis comes in pursuit of the ultimate goal of building spoken and embodied human-like interfaces that are able to interact with humans under human terms. Such interfaces need to employ the subtle, rich and multidimensional signals of communicative and social value that complement the stream of words – signals humans typically use when interacting with each other.

    The studies presented in the thesis concern facial signals used in spoken communication, and can be divided into two connected groups. The first is targeted towards exploring and verifying models of facial signals that come in synchrony with speech and its intonation. We refer to this as visual-prosody, and as part of visual-prosody, we take prominence as a case study. We show that the use of prosodically relevant gestures in animated faces results in a more expressive and human-like behaviour. We also show that animated faces supported with these gestures result in more intelligible speech which in turn can be used to aid communication, for example in noisy environments.

    The other group of studies targets facial signals that complement speech. As spoken language is a relatively poor system for the communication of spatial information; since such information is visual in nature. Hence, the use of visual movements of spatial value, such as gaze and head movements, is important for an efficient interaction. The use of such signals is especially important when the interaction between the human and the embodied agent is situated – that is when they share the same physical space, and while this space is taken into account in the interaction.

    We study the perception, the modelling, and the interaction effects of gaze and head pose in regulating situated and multiparty spoken dialogues in two conditions. The first is the typical case where the animated face is displayed on flat surfaces, and the second where they are displayed on a physical three-dimensional model of a face. The results from the studies show that projecting the animated face onto a face-shaped mask results in an accurate perception of the direction of gaze that is generated by the avatar, and hence can allow for the use of these movements in multiparty spoken dialogue.

    Driven by these findings, the Furhat back-projected robot head is developed. Furhat employs state-of-the-art facial animation that is projected on a 3D printout of that face, and a neck to allow for head movements. Although the mask in Furhat is static, the fact that the animated face matches the design of the mask results in a physical face that is perceived to “move”.

    We present studies that show how this technique renders a more intelligible, human-like and expressive face. We further present experiments in which Furhat is used as a tool to investigate properties of facial signals in situated interaction.

    Furhat is built to study, implement, and verify models of situated and multiparty, multimodal Human-Machine spoken dialogue, a study that requires that the face is physically situated in the interaction environment rather than in a two-dimensional screen. It also has received much interest from several communities, and been showcased at several venues, including a robot exhibition at the London Science Museum. We present an evaluation study of Furhat at the exhibition where it interacted with several thousand persons in a multiparty conversation. The analysis of the data from the setup further shows that Furhat can accurately regulate multiparty interaction using gaze and head movements.

  • 7.
    Al Moubayed, Samer
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Prosodic Disambiguation in Spoken Systems Output2009In: Proceedings of Diaholmia'09: 2009 Workshop on the Semantics and Pragmatics of Dialogue / [ed] Jens Edlund, Joakim Gustafson, Anna Hjalmarsson, Gabriel Skantze, Stockholm, Sweden., 2009, 131-132 p.Conference paper (Refereed)
    Abstract [en]

    This paper presents work on using prosody in the output of spoken dialogue systems to resolve possible structural ambiguity of output utterances. An algorithm is proposed to discover ambiguous parses of an utterance and to add prosodic disambiguation events to deliver the intended structure. By conducting a pilot experiment, the automatic prosodic grouping applied to ambiguous sentences shows the ability to deliver the intended interpretation of the sentences.

  • 8.
    Al Moubayed, Samer
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Towards rich multimodal behavior in spoken dialogues with embodied agents2013In: 4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013 - Proceedings, IEEE Computer Society, 2013, 817-822 p.Conference paper (Refereed)
    Abstract [en]

    Spoken dialogue frameworks have traditionally been designed to handle a single stream of data - the speech signal. Research on human-human communication has been providing large evidence and quantifying the effects and the importance of a multitude of other multimodal nonverbal signals that people use in their communication, that shape and regulate their interaction. Driven by findings from multimodal human spoken interaction, and the advancements of capture devices and robotics and animation technologies, new possibilities are rising for the development of multimodal human-machine interaction that is more affective, social, and engaging. In such face-to-face interaction scenarios, dialogue systems can have a large set of signals at their disposal to infer context and enhance and regulate the interaction through the generation of verbal and nonverbal facial signals. This paper summarizes several design decision, and experiments that we have followed in attempts to build rich and fluent multimodal interactive systems using a newly developed hybrid robotic head called Furhat, and discuss issues and challenges that this effort is facing.

  • 9.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A robotic head using projected animated faces2011In: Proceedings of the International Conference on Audio-Visual Speech Processing 2011 / [ed] Salvi, G.; Beskow, J.; Engwall, O.; Al Moubayed, S., Stockholm: KTH Royal Institute of Technology, 2011, 71- p.Conference paper (Refereed)
    Abstract [en]

    This paper presents a setup which employs virtual animatedagents for robotic heads. The system uses a laser projector toproject animated faces onto a three dimensional face mask. This approach of projecting animated faces onto a three dimensional head surface as an alternative to using flat, two dimensional surfaces, eliminates several deteriorating effects and illusions that come with flat surfaces for interaction purposes, such as exclusive mutual gaze and situated and multi-partner dialogues. In addition to that, it provides robotic heads with a flexible solution for facial animation which takes into advantage the advancements of facial animation using computer graphics overmechanically controlled heads.

  • 10.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Acoustic-to-Articulatory Inversion based on Local Regression2010In: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, Makuhari, Japan, 2010, 937-940 p.Conference paper (Refereed)
    Abstract [en]

    This paper presents an Acoustic-to-Articulatory inversionmethod based on local regression. Two types of local regression,a non-parametric and a local linear regression have beenapplied on a corpus containing simultaneous recordings of positionsof articulators and the corresponding acoustics. A maximumlikelihood trajectory smoothing using the estimated dynamicsof the articulators is also applied on the regression estimates.The average root mean square error in estimating articulatorypositions, given the acoustics, is 1.56 mm for the nonparametricregression and 1.52 mm for the local linear regression.The local linear regression is found to perform significantlybetter than regression using Gaussian Mixture Modelsusing the same acoustic and articulatory features.

  • 11.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Ananthakrishnan, Gopal
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Enflo, Laura
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Music Acoustics. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Automatic Prominence Classification in Swedish2010In: Proceedings of Speech Prosody 2010, Workshop on Prosodic Prominence, Chicago, USA, 2010Conference paper (Refereed)
    Abstract [en]

    This study aims at automatically classifying levels of acoustic prominence on a dataset of 200 Swedish sentences of read speech by one male native speaker. Each word in the sentences was categorized by four speech experts into one of three groups depending on the level of prominence perceived. Six acoustic features at a syllable level and seven features at a word level were used. Two machine learning algorithms, namely Support Vector Machines (SVM) and memory based Learning (MBL) were trained to classify the sentences into their respective classes. The MBL gave an average word level accuracy of 69.08% and the SVM gave an average accuracy of 65.17 % on the test set. These values were comparable with the average accuracy of the human annotators with respect to the average annotations. In this study, word duration was found to be the most important feature required for classifying prominence in Swedish read speech

  • 12.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Baklouti, M.
    Chetouani, M.
    Dutoit, T.
    Mahdhaoui, A.
    Martin, J. -C
    Ondas, S.
    Pelachaud, C.
    Urbain, J.
    Yilmaz, M.
    Generating Robot/Agent Backchannels During a Storytelling Experiment: 2009 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, VOLS 1-72009In: ICRA: 2009 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, 2009, 3749-3754 p.Conference paper (Refereed)
    Abstract [en]

    This work presents the development of a real-time framework for the research of Multimodal Feedback of Robots/Talking Agents in the context of Human Robot Interaction (HRI) and Human Computer Interaction (HCI). For evaluating the framework, a Multimodal corpus is built (ENTERFACE_STEAD), and a study on the important multimodal features was done for building an active Robot/Agent listener of a storytelling experience with Humans. The experiments show that even when building the same reactive behavior models for Robot and Talking Agents, the interpretation and the realization of the behavior communicated is different due to the different communicative channels Robots/Agents offer be it physical but less-human-like in Robots, and virtual but more expressive and human-like in Talking agents.

  • 13.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A novel Skype interface using SynFace for virtual speech reading support2011In: Proceedings from Fonetik 2011, June 8 - June 10, 2011: Speech, Music and Hearing, Quarterly Progress and Status Report, TMH-OPSR, Volume 51, 2011, Stockholm, Sweden, 2011, 33-36 p.Conference paper (Other academic)
    Abstract [en]

    We describe in this paper a support client interface to the IP telephony application Skype. The system uses a variant of SynFace, a real-time speech reading support system using facial animation. The new interface is designed for the use by elderly persons, and tailored for use in systems supporting touch screens. The SynFace real-time facial animation system has previously shown ability to enhance speech comprehension for the hearing impaired persons. In this study weemploy at-home field studies on five subjects in the EU project MonAMI. We presentinsights from interviews with the test subjects on the advantages of the system, and onthe limitations of such a technology of real-time speech reading to reach the homesof elderly and the hard of hearing.

  • 14.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Effects of Visual Prominence Cues on Speech Intelligibility2009In: Proceedings of Auditory-Visual Speech Processing AVSP'09, Norwich, England, 2009Conference paper (Refereed)
    Abstract [en]

    This study reports experimental results on the effect of visual prominence, presented as gestures, on speech intelligibility. 30 acoustically vocoded sentences, permutated into different gestural conditions were presented audio-visually to 12 subjects. The analysis of correct word recognition shows a significant increase in intelligibility when focally-accented (prominent) words are supplemented with head-nods or with eye-brow raise gestures. The paper also examines coupling other acoustic phenomena to brow-raise gestures. As a result, the paper introduces new evidence on the ability of the non-verbal movements in the visual modality to support audio-visual speech perception.

  • 15.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Perception of Nonverbal Gestures of Prominence in Visual Speech Animation2010In: Proceedings of the ACM/SSPNET 2nd International Symposium on Facial Analysis and Animation, Edinburgh, UK, 2010, 25- p.Conference paper (Refereed)
    Abstract [en]

    It has long been recognized that visual speech information is important for speech perception [McGurk and MacDonald 1976] [Summerfield 1992]. Recently there has been an increasing interest in the verbal and non-verbal interaction between the visual and the acoustic modalities from production and perception perspectives. One of the prosodic phenomena which attracts much focus is prominence. Prominence is defined as when a linguistic segment is made salient in its context.

  • 16.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Prominence Detection in Swedish Using Syllable Correlates2010In: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, Makuhari, Japan, 2010, 1784-1787 p.Conference paper (Refereed)
    Abstract [en]

    This paper presents an approach to estimating word level prominence in Swedish using syllable level features. The paper discusses the mismatch problem of annotations between word level perceptual prominence and its acoustic correlates, context, and data scarcity. 200 sentences are annotated by 4 speech experts with prominence on 3 levels. A linear model for feature extraction is proposed on a syllable level features, and weights of these features are optimized to match word level annotations. We show that using syllable level features and estimating weights for the acoustic correlates to minimize the word level estimation error gives better detection accuracy compared to word level features, and that both features exceed the baseline accuracy.

  • 17.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Blomberg, Mats
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Mirning, N.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Talking with Furhat - multi-party interaction with a back-projected robot head2012In: Proceedings of Fonetik 2012, Gothenberg, Sweden, 2012, 109-112 p.Conference paper (Other academic)
    Abstract [en]

    This is a condensed presentation of some recent work on a back-projected robotic head for multi-party interaction in public settings. We will describe some of the design strategies and give some preliminary analysis of an interaction database collected at the Robotville exhibition at the London Science Museum

  • 18.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Bollepalli, Bajibabu
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Hussen-Abdelaziz, A.
    Johansson, Martin
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Koutsombogera, M.
    Lopes, J. D.
    Novikova, J.
    Oertel, Catharine
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Stefanov, Kalin
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Varol, G.
    Human-robot Collaborative Tutoring Using Multiparty Multimodal Spoken Dialogue2014Conference paper (Refereed)
    Abstract [en]

    In this paper, we describe a project that explores a novel experi-mental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robotinteraction setup is designed, and a human-human dialogue corpus is collect-ed. The corpus targets the development of a dialogue system platform to study verbal and nonverbaltutoring strategies in mul-tiparty spoken interactions with robots which are capable of spo-ken dialogue. The dialogue task is centered on two participants involved in a dialogueaiming to solve a card-ordering game. Along with the participants sits a tutor (robot) that helps the par-ticipants perform the task, and organizes and balances their inter-action. Differentmultimodal signals captured and auto-synchronized by different audio-visual capture technologies, such as a microphone array, Kinects, and video cameras, were coupled with manual annotations. These are used build a situated model of the interaction based on the participants personalities, their state of attention, their conversational engagement and verbal domi-nance, and how that is correlated with the verbal and visual feed-back, turn-management, and conversation regulatory actions gen-erated by the tutor. Driven by the analysis of the corpus, we will show also the detailed design methodologies for an affective, and multimodally rich dialogue system that allows the robot to meas-ure incrementally the attention states, and the dominance for each participant, allowing the robot head Furhat to maintain a well-coordinated, balanced, and engaging conversation, that attempts to maximize the agreement and the contribution to solve the task. This project sets the first steps to explore the potential of us-ing multimodal dialogue systems to build interactive robots that can serve in educational, team building, and collaborative task solving applications.

  • 19.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Bollepalli, Bajibabu
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Hussen-Abdelaziz, A.
    Johansson, Martin
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Koutsombogera, M.
    Lopes, J.
    Novikova, J.
    Oertel, Catharine
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Stefanov, Kalin
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Varol, G.
    Tutoring Robots: Multiparty Multimodal Social Dialogue With an Embodied Tutor2014Conference paper (Refereed)
    Abstract [en]

    This project explores a novel experimental setup towards building spoken, multi-modally rich, and human-like multiparty tutoring agent. A setup is developed and a corpus is collected that targets the development of a dialogue system platform to explore verbal and nonverbal tutoring strategies in multiparty spoken interactions with embodied agents. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. With the participants sits a tutor that helps the participants perform the task and organizes and balances their interaction. Different multimodal signals captured and auto-synchronized by different audio-visual capture technologies were coupled with manual annotations to build a situated model of the interaction based on the participants personalities, their temporally-changing state of attention, their conversational engagement and verbal dominance, and the way these are correlated with the verbal and visual feedback, turn-management, and conversation regulatory actions generated by the tutor. At the end of this chapter we discuss the potential areas of research and developments this work opens and some of the challenges that lie in the road ahead.

  • 20.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Animated Faces for Robotic Heads: Gaze and Beyond2011In: Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues / [ed] Anna Esposito, Alessandro Vinciarelli, Klára Vicsi, Catherine Pelachaud and Anton Nijholt, Springer Berlin/Heidelberg, 2011, 19-35 p.Conference paper (Refereed)
    Abstract [en]

    We introduce an approach to using animated faces for robotics where a static physical object is used as a projection surface for an animation. The talking head is projected onto a 3D physical head model. In this chapter we discuss the different benefits this approach adds over mechanical heads. After that, we investigate a phenomenon commonly referred to as the Mona Lisa gaze effect. This effect results from the use of 2D surfaces to display 3D images and causes the gaze of a portrait to seemingly follow the observer no matter where it is viewed from. The experiment investigates the perception of gaze direction by observers. The analysis shows that the 3D model eliminates the effect, and provides an accurate perception of gaze direction. We discuss at the end the different requirements of gaze in interactive systems, and explore the different settings these findings give access to.

  • 21.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Auditory visual prominence From intelligibility to behavior2009In: Journal on Multimodal User Interfaces, ISSN 1783-7677, E-ISSN 1783-8738, Vol. 3, no 4, 299-309 p.Article in journal (Refereed)
    Abstract [en]

    Auditory prominence is defined as when an acoustic segment is made salient in its context. Prominence is one of the prosodic functions that has been shown to be strongly correlated with facial movements. In this work, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study, a speech intelligibility experiment is conducted, speech quality is acoustically degraded and the fundamental frequency is removed from the signal, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrows raise gestures, which are synchronized with the auditory prominence. The experiment shows that presenting prominence as facial gestures significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a follow-up study examining the perception of the behavior of the talking heads when gestures are added over pitch accents. Using eye-gaze tracking technology and questionnaires on 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch accents opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and the understanding of the talking head.

  • 22.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Mirning, Nicole
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Tscheligi, Manfred
    Furhat goes to Robotville: a large-scale multiparty human-robot interaction data collection in a public space2012In: Proc of LREC Workshop on Multimodal Corpora, Istanbul, Turkey, 2012Conference paper (Refereed)
    Abstract [en]

    In the four days of the Robotville exhibition at the London Science Museum, UK, during which the back-projected head Furhat in a situated spoken dialogue system was seen by almost 8 000 visitors, we collected a database of 10 000 utterances spoken to Furhat in situated interaction. The data collection is an example of a particular kind of corpus collection of human-machine dialogues in public spaces that has several interesting and specific characteristics, both with respect to the technical details of the collection and with respect to the resulting corpus contents. In this paper, we take the Furhat data collection as a starting point for a discussion of the motives for this type of data collection, its technical peculiarities and prerequisites, and the characteristics of the resulting corpus.

  • 23.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC).
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence2010In: 3rd COST 2102 International Training School on Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces: Theoretical and Practical Issues / [ed] Esposito A; Esposito AM; Martone R; Muller VC; Scarpetta G, 2010, Vol. 6456, 55-71 p.Conference paper (Refereed)
    Abstract [en]

    In this chapter, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study a speech intelligibility experiment is conducted, where speech quality is acoustically degraded, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrow raising gestures. The experiment shows that perceiving visual prominence as gestures, synchronized with the auditory prominence, significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a study examining the perception of the behavior of the talking heads when gestures are added at pitch movements. Using eye-gaze tracking technology and questionnaires for 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch movements opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and helpfulness of the talking head.

  • 24.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    SynFace Phone Recognizer for Swedish Wideband and Narrowband Speech2008In: Proceedings of The second Swedish Language Technology Conference (SLTC), Stockholm, Sweden., 2008, 3-6 p.Conference paper (Other academic)
    Abstract [en]

    In this paper, we present new results and comparisons of the real-time lips synchronized talking head SynFace on different Swedish databases and bandwidth. The work involves training SynFace on narrow-band telephone speech from the Swedish SpeechDat, and on the narrow-band and wide-band Speecon corpus. Auditory perceptual tests are getting established for SynFace as an audio visual hearing support for the hearing-impaired. Preliminary results show high recognition accuracy compared to other languages.

  • 25.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Spontaneous spoken dialogues with the Furhat human-like robot head2014In: HRI '14 Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction, Bielefeld, Germany, 2014, 326- p.Conference paper (Refereed)
    Abstract [en]

    We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is an anthropomorphic robot head that utilizes facial animation for physical robot heads using back-projection. In the system, multimodality is enabled using speech and rich visual input signals such as multi-person real-time face tracking and microphone tracking. The demonstrator will showcase a system that is able to carry out social dialogue with multiple interlocutors simultaneously with rich output signals such as eye and head coordination, lips synchronized speech synthesis, and non-verbal facial gestures used to regulate fluent and expressive multiparty conversations. The dialogue design is performed using the IrisTK [4] dialogue authoring toolkit developed at KTH. The system will also be able to perform a moderator in a quiz-game showing different strategies for regulating spoken situated interactions.

  • 26.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The Furhat Social Companion Talking Head2013In: Interspeech 2013 - Show and Tell, 2013, 747-749 p.Conference paper (Refereed)
    Abstract [en]

    In this demonstrator we present the Furhat robot head. Furhat is a highly human-like robot head in terms of dynamics, thanks to its use of back-projected facial animation. Furhat also takes advantage of a complex and advanced dialogue toolkits designed to facilitate rich and fluent multimodal multiparty human-machine situated and spoken dialogue. The demonstrator will present a social dialogue system with Furhat that allows for several simultaneous interlocutors, and takes advantage of several verbal and nonverbal input signals such as speech input, real-time multi-face tracking, and facial analysis, and communicates with its users in a mixed initiative dialogue, using state of the art speech synthesis, with rich prosody, lip animated facial synthesis, eye and head movements, and gestures.

  • 27.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Furhat: A Back-projected Human-like Robot Head for Multiparty Human-Machine Interaction2012In: Cognitive Behavioural Systems: COST 2102 International Training School, Dresden, Germany, February 21-26, 2011, Revised Selected Papers / [ed] Anna Esposito, Antonietta M. Esposito, Alessandro Vinciarelli, Rüdiger Hoffmann, Vincent C. Müller, Springer Berlin/Heidelberg, 2012, 114-130 p.Conference paper (Refereed)
    Abstract [en]

    In this chapter, we first present a summary of findings from two previous studies on the limitations of using flat displays with embodied conversational agents (ECAs) in the contexts of face-to-face human-agent interaction. We then motivate the need for a three dimensional display of faces to guarantee accurate delivery of gaze and directional movements and present Furhat, a novel, simple, highly effective, and human-like back-projected robot head that utilizes computer animation to deliver facial movements, and is equipped with a pan-tilt neck. After presenting a detailed summary on why and how Furhat was built, we discuss the advantages of using optically projected animated agents for interaction. We discuss using such agents in terms of situatedness, environment, context awareness, and social, human-like face-to-face interaction with robots where subtle nonverbal and social facial signals can be communicated. At the end of the chapter, we present a recent application of Furhat as a multimodal multiparty interaction system that was presented at the London Science Museum as part of a robot festival,. We conclude the paper by discussing future developments, applications and opportunities of this technology.

  • 28.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Öster, Anne-Marie
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    van Son, Nic
    Viataal, Nijmegen, The Netherlands.
    Ormel, Ellen
    Viataal, Nijmegen, The Netherlands.
    Herzke, Tobias
    HörTech gGmbH, Germany.
    Studies on Using the SynFace Talking Head for the Hearing Impaired2009In: Proceedings of Fonetik'09: The XXIIth Swedish Phonetics Conference, June 10-12, 2009 / [ed] Peter Branderud, Hartmut Traunmüller, Stockholm: Stockholm University, 2009, 140-143 p.Conference paper (Other academic)
    Abstract [en]

    SynFace is a lip-synchronized talking agent which is optimized as a visual reading support for the hearing impaired. In this paper wepresent the large scale hearing impaired user studies carried out for three languages in the Hearing at Home project. The user tests focuson measuring the gain in Speech Reception Threshold in Noise and the effort scaling when using SynFace by hearing impaired people, where groups of hearing impaired subjects with different impairment levels from mild to severe and cochlear implants are tested. Preliminaryanalysis of the results does not show significant gain in SRT or in effort scaling. But looking at large cross-subject variability in both tests, it isclear that many subjects benefit from SynFace especially with speech with stereo babble.

  • 29.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Öster, Ann-Marie
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Salvi, Giampiero
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    van Son, Nic
    Ormel, Ellen
    Virtual Speech Reading Support for Hard of Hearing in a Domestic Multi-Media Setting2009In: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2009, 1443-1446 p.Conference paper (Refereed)
    Abstract [en]

    In this paper we present recent results on the development of the SynFace lip synchronized talking head towards multilinguality, varying signal conditions and noise robustness in the Hearing at Home project. We then describe the large scale hearing impaired user studies carried out for three languages. The user tests focus on measuring the gain in Speech Reception Threshold in Noise when using SynFace, and on measuring the effort scaling when using SynFace by hearing impaired people. Preliminary analysis of the results does not show significant gain in SRT or in effort scaling. But looking at inter-subject variability, it is clear that many subjects benefit from SynFace especially with speech with stereo babble noise.

  • 30.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    De Smet, Michael
    Van Hamme, Hugo
    Lip Synchronization: from Phone Lattice to PCA Eigen-projections using Neural Networks2008In: INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2008, 2016-2019 p.Conference paper (Refereed)
    Abstract [en]

    Lip synchronization is the process of generating natural lip movements from a speech signal. In this work we address the lip-sync problem using an automatic phone recognizer that generates a phone lattice carrying posterior probabilities. The acoustic feature vector contains the posterior probabilities of all the phones over a time window centered at the current time point. Hence this representation characterizes the phone recognition output including the confusion patterns caused by its limited accuracy. A 3D face model with varying texture is computed by analyzing a video recording of the speaker using a 3D morphable model. Training a neural network using 30 000 data vectors from an audiovisual recording in Dutch resulted in a very good simulation of the face on independent data sets of the same or of a different speaker.

  • 31.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Taming Mona Lisa: communicating gaze faithfully in 2D and 3D facial projections2012In: ACM Transactions on Interactive Intelligent Systems, ISSN 2160-6455, Vol. 1, no 2, 25- p., 11Article in journal (Refereed)
    Abstract [en]

    The perception of gaze plays a crucial role in human-human interaction. Gaze has been shown to matter for a number of aspects of communication and dialogue, especially for managing the flow of the dialogue and participant attention, for deictic referencing, and for the communication of attitude. When developing embodied conversational agents (ECAs) and talking heads, modeling and delivering accurate gaze targets is crucial. Traditionally, systems communicating through talking heads have been displayed to the human conversant using 2D displays, such as flat monitors. This approach introduces severe limitations for an accurate communication of gaze since 2D displays are associated with several powerful effects and illusions, most importantly the Mona Lisa gaze effect, where the gaze of the projected head appears to follow the observer regardless of viewing angle. We describe the Mona Lisa gaze effect and its consequences in the interaction loop, and propose a new approach for displaying talking heads using a 3D projection surface (a physical model of a human head) as an alternative to the traditional flat surface projection. We investigate and compare the accuracy of the perception of gaze direction and the Mona Lisa gaze effect in 2D and 3D projection surfaces in a five subject gaze perception experiment. The experiment confirms that a 3Dprojection surface completely eliminates the Mona Lisa gaze effect and delivers very accurate gaze direction that is independent of the observer's viewing angle. Based on the data collected in this experiment, we rephrase the formulation of the Mona Lisa gaze effect. The data, when reinterpreted, confirms the predictions of the new model for both 2D and 3D projection surfaces. Finally, we discuss the requirements on different spatially interactive systems in terms of gaze direction, and propose new applications and experiments for interaction in a human-ECA and a human-robot settings made possible by this technology.

  • 32.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Analysis of gaze and speech patterns in three-party quiz game interaction2013In: Interspeech 2013, 2013, 1126-1130 p.Conference paper (Refereed)
    Abstract [en]

    In order to understand and model the dynamics between interaction phenomena such as gaze and speech in face-to-face multiparty interaction between humans, we need large quantities of reliable, objective data of such interactions. To date, this type of data is in short supply. We present a data collection setup using automated, objective techniques in which we capture the gaze and speech patterns of triads deeply engaged in a high-stakes quiz game. The resulting corpus consists of five one-hour recordings, and is unique in that it makes use of three state-of-the-art gaze trackers (one per subject) in combination with a state-of-theart conical microphone array designed to capture roundtable meetings. Several video channels are also included. In this paper we present the obstacles we encountered and the possibilities afforded by a synchronised, reliable combination of large-scale multi-party speech and gaze data, and an overview of the first analyses of the data. Index Terms: multimodal corpus, multiparty dialogue, gaze patterns, multiparty gaze.

  • 33.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Heylen, D.
    Bohus, D.
    Koutsombogera, Maria
    Papageorgiou, H.
    Esposito, A.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    UM3I 2014: International workshop on understanding and modeling multiparty, multimodal interactions2014In: ICMI 2014 - Proceedings of the 2014 International Conference on Multimodal Interaction, Association for Computing Machinery (ACM), 2014, 537-538 p.Conference paper (Refereed)
    Abstract [en]

    In this paper, we present a brief summary of the international workshop on Modeling Multiparty, Multimodal Interactions. The UM3I 2014 workshop is held in conjunction with the ICMI 2014 conference. The workshop will highlight recent developments and adopted methodologies in the analysis and modeling of multiparty and multimodal interactions, the design and implementation principles of related human-machine interfaces, as well as the identification of potential limitations and ways of overcoming them.

  • 34.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Effects of 2D and 3D Displays on Turn-taking Behavior in Multiparty Human-Computer Dialog2011In: SemDial 2011: Proceedings of the 15th Workshop on the Semantics and Pragmatics of Dialogue / [ed] Ron Artstein, Mark Core, David DeVault, Kallirroi Georgila, Elsi Kaiser, Amanda Stent, Los Angeles, CA, 2011, 192-193 p.Conference paper (Refereed)
    Abstract [en]

    The perception of gaze from an animated agenton a 2D display has been shown to suffer fromthe Mona Lisa effect, which means that exclusive mutual gaze cannot be established if there is more than one observer. In this study, we investigate this effect when it comes to turntaking control in a multi-party human-computerdialog setting, where a 2D display is compared to a 3D projection. The results show that the 2D setting results in longer response times andlower turn-taking accuracy.

  • 35.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Perception of Gaze Direction for Situated Interaction2012In: Proceedings of the 4th Workshop on Eye Gaze in Intelligent Human Machine Interaction, Gaze-In 2012, ACM , 2012Conference paper (Refereed)
    Abstract [en]

    Accurate human perception of robots' gaze direction is crucial for the design of a natural and fluent situated multimodal face-to-face interaction between humans and machines. In this paper, we present an experiment targeted at quantifying the effects of different gaze cues synthesized using the Furhat back-projected robot head, on the accuracy of perceived spatial direction of gaze by humans using 18 test subjects. The study first quantifies the accuracy of the perceived gaze direction in a human-human setup, and compares that to the use of synthesized gaze movements in different conditions: viewing the robot eyes frontal or at a 45 degrees angle side view. We also study the effect of 3D gaze by controlling both eyes to indicate the depth of the focal point (vergence), the use of gaze or head pose, and the use of static or dynamic eyelids. The findings of the study are highly relevant to the design and control of robots and animated agents in situated face-to-face interaction.

  • 36.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Turn-taking Control Using Gaze in Multiparty Human-Computer Dialogue: Effects of 2D and 3D Displays2011In: Proceedings of the International Conference on Audio-Visual Speech Processing 2011, Stockholm: KTH Royal Institute of Technology, 2011, 99-102 p.Conference paper (Refereed)
    Abstract [en]

    In a previous experiment we found that the perception of gazefrom an animated agent on a two-dimensional display suffersfrom the Mona Lisa effect, which means that exclusive mutual gaze cannot be established if there is more than one observer. By using a three-dimensional projection surface, this effect can be eliminated. In this study, we investigate whether this difference also holds for the turn-taking behaviour of subjects interacting with the animated agent in a multi-party dialogue. We present a Wizard-of-Oz experiment where five subjects talk toan animated agent in a route direction dialogue. The results show that the subjects to some extent can infer the intended target of the agent’s questions, in spite of the Mona Lisa effect, but that the accuracy of gaze when it comes to selecting an addressee is still significantly lower in the 2D condition, ascompared to the 3D condition. The response time is also significantly longer in the 2D condition, indicating that the inference of intended gaze may require additional cognitive efforts.

  • 37.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Lip-reading: Furhat audio visual intelligibility of a back projected animated face2012In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer Berlin/Heidelberg, 2012, 196-203 p.Conference paper (Refereed)
    Abstract [en]

    Back projecting a computer animated face, onto a three dimensional static physical model of a face, is a promising technology that is gaining ground as a solution to building situated, flexible and human-like robot heads. In this paper, we first briefly describe Furhat, a back projected robot head built for the purpose of multimodal multiparty human-machine interaction, and its benefits over virtual characters and robotic heads; and then motivate the need to investigating the contribution to speech intelligibility Furhat's face offers. We present an audio-visual speech intelligibility experiment, in which 10 subjects listened to short sentences with degraded speech signal. The experiment compares the gain in intelligibility between lip reading a face visualized on a 2D screen compared to a 3D back-projected face and from different viewing angles. The results show that the audio-visual speech intelligibility holds when the avatar is projected onto a static face model (in the case of Furhat), and even, rather surprisingly, exceeds it. This means that despite the movement limitations back projected animated face models bring about; their audio visual speech intelligibility is equal, or even higher, compared to the same models shown on flat displays. At the end of the paper we discuss several hypotheses on how to interpret the results, and motivate future investigations to better explore the characteristics of visual speech perception 3D projected faces.

  • 38.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    The Furhat Back-Projected Humanoid Head-Lip Reading, Gaze And Multi-Party Interaction2013In: International Journal of Humanoid Robotics, ISSN 0219-8436, Vol. 10, no 1, 1350005- p.Article in journal (Refereed)
    Abstract [en]

    In this paper, we present Furhat - a back-projected human-like robot head using state-of-the art facial animation. Three experiments are presented where we investigate how the head might facilitate human - robot face-to-face interaction. First, we investigate how the animated lips increase the intelligibility of the spoken output, and compare this to an animated agent presented on a flat screen, as well as to a human face. Second, we investigate the accuracy of the perception of Furhat's gaze in a setting typical for situated interaction, where Furhat and a human are sitting around a table. The accuracy of the perception of Furhat's gaze is measured depending on eye design, head movement and viewing angle. Third, we investigate the turn-taking accuracy of Furhat in a multi-party interactive setting, as compared to an animated agent on a flat screen. We conclude with some observations from a public setting at a museum, where Furhat interacted with thousands of visitors in a multi-party interaction.

  • 39.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Stefanov, Kalin
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Multimodal Multiparty Social Interaction with the Furhat Head2012Conference paper (Refereed)
    Abstract [en]

    We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is a human-like interface that utilizes facial animation for physical robot heads using back-projection. In the system, multimodality is enabled using speech and rich visual input signals such as multi-person real-time face tracking and microphone tracking. The demonstrator will showcase a system that is able to carry out social dialogue with multiple interlocutors simultaneously with rich output signals such as eye and head coordination, lips synchronized speech synthesis, and non-verbal facial gestures used to regulate fluent and expressive multiparty conversations.

  • 40.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions2014In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 28, no 2, 607-618 p.Article in journal (Refereed)
    Abstract [en]

    In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a,set of 180 short sentences, under three conditions: normal speech (in quiet), Lombard speech (in noise), and whispering. We then produced an animated 3D avatar with similar shape and appearance as the original talker and used an error minimization procedure to drive the animated version of the talker in a way that matched the original performance as closely as possible. In a perceptual intelligibility study with degraded audio we then compared the animated talker against the real talker and the audio alone, in terms of audio-visual word recognition rate across the three different production conditions. We found that the visual intelligibility of the animated talker was on par with the real talker for the Lombard and whisper conditions. In addition we created two incongruent conditions where normal speech audio was paired with animated Lombard speech or whispering. When compared to the congruent normal speech condition, Lombard animation yields a significant increase in intelligibility, despite the AV-incongruence. In a separate evaluation, we gathered subjective opinions on the different animations, and found that some degree of incongruence was generally accepted.

  • 41.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Can Anybody Read Me? Motion Capture Recordings for an Adaptable Visual Speech Synthesizer2012In: In proceedings of The Listening Talker, Edinburgh, UK., 2012, 52-52 p.Conference paper (Refereed)
  • 42.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Towards Fully Automated Motion Capture of Signs -- Development and Evaluation of a Key Word Signing Avatar2015In: ACM Transactions on Accessible Computing, ISSN 1936-7228, Vol. 7, no 2, 7:1-7:17 p.Article in journal (Refereed)
    Abstract [en]

    Motion capture of signs provides unique challenges in the field of multimodal data collection. The dense packaging of visual information requires high fidelity and high bandwidth of the captured data. Even though marker-based optical motion capture provides many desirable features such as high accuracy, global fitting, and the ability to record body and face simultaneously, it is not widely used to record finger motion, especially not for articulated and syntactic motion such as signs. Instead, most signing avatar projects use costly instrumented gloves, which require long calibration procedures. In this article, we evaluate the data quality obtained from optical motion capture of isolated signs from Swedish sign language with a large number of low-cost cameras. We also present a novel dual-sensor approach to combine the data with low-cost, five-sensor instrumented gloves to provide a recording method with low manual postprocessing. Finally, we evaluate the collected data and the dual-sensor approach as transferred to a highly stylized avatar. The application of the avatar is a game-based environment for training Key Word Signing (KWS) as augmented and alternative communication (AAC), intended for children with communication disabilities.

  • 43.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Aspects of co-occurring syllables and head nods in spontaneous dialogue2013In: Proceedings of 12th International Conference on Auditory-Visual Speech Processing (AVSP2013), 2013, 169-172 p.Conference paper (Refereed)
    Abstract [en]

    This paper reports on the extraction and analysis of head nods taken from motion capture data of spontaneous dialogue in Swedish. The head nods were extracted automatically and then manually classified in terms of gestures having a beat function or multifunctional gestures. Prosodic features were extracted from syllables co-occurring with the beat gestures. While the peak rotation of the nod is on average aligned with the stressed syllable, the results show considerable variation in fine temporal synchronization. The syllables co-occurring with the gestures generally show greater intensity, higher F0, and greater F0 range when compared to the mean across the entire dialogue. A functional analysis shows that the majority of the syllables belong to words bearing a focal accent.

  • 44.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Automatic annotation of gestural units in spontaneous face-to-face interaction2016In: MA3HMI 2016 - Proceedings of the Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, 2016, 15-19 p.Conference paper (Refereed)
    Abstract [en]

    Speech and gesture co-occur in spontaneous dialogue in a highly complex fashion. There is a large variability in the motion that people exhibit during a dialogue, and different kinds of motion occur during different states of the interaction. A wide range of multimodal interface applications, for example in the fields of virtual agents or social robots, can be envisioned where it is important to be able to automatically identify gestures that carry information and discriminate them from other types of motion. While it is easy for a human to distinguish and segment manual gestures from a flow of multimodal information, the same task is not trivial to perform for a machine. In this paper we present a method to automatically segment and label gestural units from a stream of 3D motion capture data. The gestural flow is modeled with a 2-level Hierarchical Hidden Markov Model (HHMM) where the sub-states correspond to gesture phases. The model is trained based on labels of complete gesture units and self-adaptive manipulators. The model is tested and validated on two datasets differing in genre and in method of capturing motion, and outperforms a state-of-the-art SVM classifier on a publicly available dataset.

  • 45.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Extracting and analysing co-speech head gestures from motion-capture data2013In: Proceedings of Fonetik 2013 / [ed] Eklund, Robert, Linköping University Electronic Press, 2013, 1-4 p.Conference paper (Refereed)
  • 46.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Extracting and analyzing head movements accompanying spontaneous dialogue2013In: Conference Proceedings TiGeR 2013: Tilburg Gesture Research Meeting, 2013Conference paper (Refereed)
    Abstract [en]

    This paper reports on a method developed for extracting and analyzing head gestures taken from motion capture data of spontaneous dialogue in Swedish. Candidate head gestures with beat function were extracted automatically and then manually classified using a 3D player which displays timesynced audio and 3D point data of the motion capture markers together with animated characters. Prosodic features were extracted from syllables co-occurring with a subset of the classified gestures. The beat gestures show considerable variation in temporal synchronization with the syllables, while the syllables generally show greater intensity, higher F0, and greater F0 range when compared to the mean across the entire dialogue. Additional features for further analysis and automatic classification of the head gestures are discussed.

  • 47.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    O'Sullivan, C.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Robust online motion capture labeling of finger markers2016In: Proceedings - Motion in Games 2016: 9th International Conference on Motion in Games, MIG 2016, ACM Digital Library, 2016, 7-13 p.Conference paper (Refereed)
    Abstract [en]

    Passive optical motion capture is one of the predominant technologies for capturing high fidelity human skeletal motion, and is a workhorse in a large number of areas such as bio-mechanics, film and video games. While most state-of-the-art systems can automatically identify and track markers on the larger parts of the human body, the markers attached to fingers provide unique challenges and usually require extensive manual cleanup. In this work we present a robust online method for identification and tracking of passive motion capture markers attached to the fingers of the hands. The method is especially suited for large capture volumes and sparse marker sets of 3 to 10 markers per hand. Once trained, our system can automatically initialize and track the markers, and the subject may exit and enter the capture volume at will. By using multiple assignment hypotheses and soft decisions, it can robustly recover from a difficult situation with many simultaneous occlusions and false observations (ghost markers). We evaluate the method on a collection of sparse marker sets commonly used in industry and in the research community. We also compare the results with two of the most widely used motion capture platforms: Motion Analysis Cortex and Vicon Blade. The results show that our method is better at attaining correct marker labels and is especially beneficial for real-time applications.

  • 48. Alku, Paavo
    et al.
    Airas, Matti
    Björkner, Eva
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Sundberg, Johan
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    An amplitude quotient based method to analyze changes in the shape of the glottal pulse in the regulation of vocal intensity2006In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 120, no 2, 1052-1062 p.Article in journal (Refereed)
    Abstract [en]

    This study presents an approach to visualizing intensity regulation in speech. The method expresses a voice sample in a two-dimensional space using amplitude-domain values extracted from the glottal flow estimated by inverse filtering. The two-dimensional presentation is obtained by expressing a time-domainmeasure of the glottal pulse, the amplitude quotient (AQ), as a function of the negative peak amplitude of the flow derivative (d(peak)). The regulation of vocal intensity was analyzed with the proposed method from voices varying from extremely soft to very loud with a SPL range of approximately 55 dB. When vocal intensity was increased, the speech samples first showed a rapidly decreasing trend as expressed on the proposed AQ-d(peak) graph. When intensity was further raised, the location of the samples converged toward a horizontal line, the asymptote of a hypothetical hyperbola. This behavior of the AQ-d(peak) graph indicates that the intensity regulation strategy changes from laryngeal to respiratory mechanisms and the method chosen makes it possible to quantify how control mechanisms underlying the regulation of vocal intensity change gradually between the two means. The proposed presentation constitutes an easy-to-implement method to visualize the function of voice production in intensity regulation because the only information needed is the glottal flow wave form estimated by inverse filtering the acoustic speech pressure signal.

  • 49. Allwood, Jens
    et al.
    Cerrato, Loredana
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH. KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
    Jokinen, Kristiina
    Navarretta, Costanza
    Paggio, Patrizia
    The MUMIN coding scheme for the annotation of feedback, turn management and sequencing phenomena2007In: Language resources and evaluation, ISSN 1574-020X, E-ISSN 1572-0218, Vol. 41, no 3-4, 273-287 p.Article in journal (Refereed)
    Abstract [en]

    This paper deals with a multimodal annotation scheme dedicated to the study of gestures in interpersonal communication, with particular regard to the role played by multimodal expressions for feedback, turn management and sequencing. The scheme has been developed under the framework of the MUMIN network and tested on the analysis of multimodal behaviour in short video clips in Swedish, Finnish and Danish. The preliminary results obtained in these studies show that the reliability of the categories defined in the scheme is acceptable, and that the scheme as a whole constitutes a versatile analysis tool for the study of multimodal communication behaviour.

  • 50. Altmann, U.
    et al.
    Oertel, Catharine
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Campbell, N.
    Conversational Involvement and Synchronous Nonverbal Behaviour2012In: Cognitive Behavioural Systems: COST 2102 International Training School, Dresden, Germany, February 21-26, 2011, Revised Selected Papers / [ed] Anna Esposito, Antonietta M. Esposito, Alessandro Vinciarelli, Rüdiger Hoffmann, Vincent C. Müller, Springer Berlin/Heidelberg, 2012, 343-352 p.Conference paper (Refereed)
    Abstract [en]

    Measuring the quality of an interaction by means of low-level cues has been the topic of many studies in the last couple of years. In this study we propose a novel method for conversation-quality-assessment. We first test whether manual ratings of conversational involvement and automatic estimation of synchronisation of facial activity are correlated. We hypothesise that the higher the synchrony the higher the involvement. We compare two different synchronisation measures. The first measure is defined as the similarity of facial activity at a given point in time. The second is based on dependence analyses between the facial activity time series of two interlocutors. We found that dependence measure correlates more with conversational involvement than similarity measure.

1234567 1 - 50 of 1039
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf