Endre søk
Begrens søket
12 1 - 50 of 57
RefereraExporteraLink til resultatlisten
Permanent link
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Treff pr side
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Forfatter A-Ø
  • Forfatter Ø-A
  • Tittel A-Ø
  • Tittel Ø-A
  • Type publikasjon A-Ø
  • Type publikasjon Ø-A
  • Eldste først
  • Nyeste først
  • Skapad (Eldste først)
  • Skapad (Nyeste først)
  • Senast uppdaterad (Eldste først)
  • Senast uppdaterad (Nyeste først)
  • Disputationsdatum (tidligste først)
  • Disputationsdatum (siste først)
  • Standard (Relevans)
  • Forfatter A-Ø
  • Forfatter Ø-A
  • Tittel A-Ø
  • Tittel Ø-A
  • Type publikasjon A-Ø
  • Type publikasjon Ø-A
  • Eldste først
  • Nyeste først
  • Skapad (Eldste først)
  • Skapad (Nyeste først)
  • Senast uppdaterad (Eldste først)
  • Senast uppdaterad (Nyeste først)
  • Disputationsdatum (tidligste først)
  • Disputationsdatum (siste først)
Merk
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1.
    Abou Zliekha, M.
    et al.
    Damascus University/Faculty of Information Technology.
    Al Moubayed, Samer
    Damascus University/Faculty of Information Technology.
    Al Dakkak, O.
    Higher Institute of Applied Science and Technology (HIAST).
    Ghneim, N.
    Higher Institute of Applied Science and Technology (HIAST).
    Emotional Audio-Visual Arabic Text to Speech2006Inngår i: Proceedings of the XIV European Signal Processing Conference (EUSIPCO), Florence, Italy, 2006Konferansepaper (Fagfellevurdert)
    Abstract [en]

    The goal of this paper is to present an emotional audio-visual. Text to speech system for the Arabic Language. The system is based on two entities: un emotional audio text to speech system which generates speech depending on the input text and the desired emotion type, and un emotional Visual model which generates the talking heads, by forming the corresponding visemes. The phonemes to visemes mapping, and the emotion shaping use a 3-paramertic face model, based on the Abstract Muscle Model. We have thirteen viseme models and five emotions as parameters to the face model. The TTS produces the phonemes corresponding to the input text, the speech with the suitable prosody to include the prescribed emotion. In parallel the system generates the visemes and sends the controls to the facial model to get the animation of the talking head in real time.

  • 2. Agarwal, Priyanshu
    et al.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Robotik, perception och lärande, RPL.
    Alspach, Alexander
    Kim, Joohyung
    Carter, Elizabeth J.
    Lehman, Jill Fain
    Yamane, Katsu
    Imitating Human Movement with Teleoperated Robotic Head2016Inngår i: 2016 25TH IEEE INTERNATIONAL SYMPOSIUM ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION (RO-MAN), 2016, s. 630-637Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Effective teleoperation requires real-time control of a remote robotic system. In this work, we develop a controller for realizing smooth and accurate motion of a robotic head with application to a teleoperation system for the Furhat robot head [1], which we call TeleFurhat. The controller uses the head motion of an operator measured by a Microsoft Kinect 2 sensor as reference and applies a processing framework to condition and render the motion on the robot head. The processing framework includes a pre-filter based on a moving average filter, a neural network-based model for improving the accuracy of the raw pose measurements of Kinect, and a constrained-state Kalman filter that uses a minimum jerk model to smooth motion trajectories and limit the magnitude of changes in position, velocity, and acceleration. Our results demonstrate that the robot can reproduce the human head motion in real time with a latency of approximately 100 to 170 ms while operating within its physical limits. Furthermore, viewers prefer our new method over rendering the raw pose data from Kinect.

  • 3.
    Al Dakkak, O.
    et al.
    Higher Institute of Applied Sciencenand Technology (HIAST).
    Ghneim, N.
    Higher Institute of Applied Sciencenand Technology (HIAST).
    Abou Zliekha, M.
    Damascus University/Faculty of Information Technology.
    Al Moubayed, Samer
    Damascus University/Faculty of Information Technology.
    Emotional Inclusion in An Arabic Text-To-Speech2005Inngår i: Proceedings of the 13th European Signal Processing Conference (EUSIPCO), Antalya, Turkey, 2005Konferansepaper (Fagfellevurdert)
    Abstract [en]

    The goal of this paper is to present an emotional audio-visua lText to speech system for the Arabic Language. The system is based on two entities: un emotional audio text to speech system which generates speech depending on the input text and the desired emotion type, and un emotional Visual model which generates the talking heads, by forming the corresponding visemes. The phonemes to visemes mapping, and the emotion shaping use a 3-paramertic face model, based on the Abstract Muscle Model. We have thirteen viseme models and five emotions as parameters to the face model. The TTS produces the phonemes corresponding to the input text, the speech with the suitable prosody to include the prescribed emotion. In parallel the system generates the visemes and sends the controls to the facial model to get the animation of the talking head in real time.

  • 4.
    Al Dakkak, O.
    et al.
    HIAST, Damascus, Syria.
    Ghneim, N.
    HIAST, Damascus, Syria.
    Abou Zliekha, M.
    Damascus University.
    Al Moubayed, Samer
    Damascus University.
    Prosodic Feature Introduction and Emotion Incorporation in an Arabic TTS2006Inngår i: Proceedings of IEEE International Conference on Information and Communication Technologies, Damascus, Syria, 2006, s. 1317-1322Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Text-to-speech is a crucial part of many man-machine communication applications, such as phone booking and banking, vocal e-mail, and many other applications. In addition to many other applications concerning impaired persons, such as: reading machines for blinds, talking machines for persons with speech difficulties. However, the main drawback of most speech synthesizers in the talking machines, are their metallic sounds. In order to sound naturally, we have to incorporate prosodic features, as close as possible to natural prosody, this helps to improve the quality of the synthetic speech. Actual researches in the world are towards better "automatic prosody generation".

  • 5.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Bringing the avatar to life: Studies and developments in facial communication for virtual agents and robots2012Doktoravhandling, med artikler (Annet vitenskapelig)
    Abstract [en]

    The work presented in this thesis comes in pursuit of the ultimate goal of building spoken and embodied human-like interfaces that are able to interact with humans under human terms. Such interfaces need to employ the subtle, rich and multidimensional signals of communicative and social value that complement the stream of words – signals humans typically use when interacting with each other.

    The studies presented in the thesis concern facial signals used in spoken communication, and can be divided into two connected groups. The first is targeted towards exploring and verifying models of facial signals that come in synchrony with speech and its intonation. We refer to this as visual-prosody, and as part of visual-prosody, we take prominence as a case study. We show that the use of prosodically relevant gestures in animated faces results in a more expressive and human-like behaviour. We also show that animated faces supported with these gestures result in more intelligible speech which in turn can be used to aid communication, for example in noisy environments.

    The other group of studies targets facial signals that complement speech. As spoken language is a relatively poor system for the communication of spatial information; since such information is visual in nature. Hence, the use of visual movements of spatial value, such as gaze and head movements, is important for an efficient interaction. The use of such signals is especially important when the interaction between the human and the embodied agent is situated – that is when they share the same physical space, and while this space is taken into account in the interaction.

    We study the perception, the modelling, and the interaction effects of gaze and head pose in regulating situated and multiparty spoken dialogues in two conditions. The first is the typical case where the animated face is displayed on flat surfaces, and the second where they are displayed on a physical three-dimensional model of a face. The results from the studies show that projecting the animated face onto a face-shaped mask results in an accurate perception of the direction of gaze that is generated by the avatar, and hence can allow for the use of these movements in multiparty spoken dialogue.

    Driven by these findings, the Furhat back-projected robot head is developed. Furhat employs state-of-the-art facial animation that is projected on a 3D printout of that face, and a neck to allow for head movements. Although the mask in Furhat is static, the fact that the animated face matches the design of the mask results in a physical face that is perceived to “move”.

    We present studies that show how this technique renders a more intelligible, human-like and expressive face. We further present experiments in which Furhat is used as a tool to investigate properties of facial signals in situated interaction.

    Furhat is built to study, implement, and verify models of situated and multiparty, multimodal Human-Machine spoken dialogue, a study that requires that the face is physically situated in the interaction environment rather than in a two-dimensional screen. It also has received much interest from several communities, and been showcased at several venues, including a robot exhibition at the London Science Museum. We present an evaluation study of Furhat at the exhibition where it interacted with several thousand persons in a multiparty conversation. The analysis of the data from the setup further shows that Furhat can accurately regulate multiparty interaction using gaze and head movements.

  • 6.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Prosodic Disambiguation in Spoken Systems Output2009Inngår i: Proceedings of Diaholmia'09: 2009 Workshop on the Semantics and Pragmatics of Dialogue / [ed] Jens Edlund, Joakim Gustafson, Anna Hjalmarsson, Gabriel Skantze, Stockholm, Sweden., 2009, s. 131-132Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This paper presents work on using prosody in the output of spoken dialogue systems to resolve possible structural ambiguity of output utterances. An algorithm is proposed to discover ambiguous parses of an utterance and to add prosodic disambiguation events to deliver the intended structure. By conducting a pilot experiment, the automatic prosodic grouping applied to ambiguous sentences shows the ability to deliver the intended interpretation of the sentences.

  • 7.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Towards rich multimodal behavior in spoken dialogues with embodied agents2013Inngår i: 4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013 - Proceedings, IEEE Computer Society, 2013, s. 817-822Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Spoken dialogue frameworks have traditionally been designed to handle a single stream of data - the speech signal. Research on human-human communication has been providing large evidence and quantifying the effects and the importance of a multitude of other multimodal nonverbal signals that people use in their communication, that shape and regulate their interaction. Driven by findings from multimodal human spoken interaction, and the advancements of capture devices and robotics and animation technologies, new possibilities are rising for the development of multimodal human-machine interaction that is more affective, social, and engaging. In such face-to-face interaction scenarios, dialogue systems can have a large set of signals at their disposal to infer context and enhance and regulate the interaction through the generation of verbal and nonverbal facial signals. This paper summarizes several design decision, and experiments that we have followed in attempts to build rich and fluent multimodal interactive systems using a newly developed hybrid robotic head called Furhat, and discuss issues and challenges that this effort is facing.

  • 8.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Alexanderson, Simon
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    A robotic head using projected animated faces2011Inngår i: Proceedings of the International Conference on Audio-Visual Speech Processing 2011 / [ed] Salvi, G.; Beskow, J.; Engwall, O.; Al Moubayed, S., Stockholm: KTH Royal Institute of Technology, 2011, s. 71-Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This paper presents a setup which employs virtual animatedagents for robotic heads. The system uses a laser projector toproject animated faces onto a three dimensional face mask. This approach of projecting animated faces onto a three dimensional head surface as an alternative to using flat, two dimensional surfaces, eliminates several deteriorating effects and illusions that come with flat surfaces for interaction purposes, such as exclusive mutual gaze and situated and multi-partner dialogues. In addition to that, it provides robotic heads with a flexible solution for facial animation which takes into advantage the advancements of facial animation using computer graphics overmechanically controlled heads.

  • 9.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Ananthakrishnan, Gopal
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Acoustic-to-Articulatory Inversion based on Local Regression2010Inngår i: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, Makuhari, Japan, 2010, s. 937-940Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This paper presents an Acoustic-to-Articulatory inversionmethod based on local regression. Two types of local regression,a non-parametric and a local linear regression have beenapplied on a corpus containing simultaneous recordings of positionsof articulators and the corresponding acoustics. A maximumlikelihood trajectory smoothing using the estimated dynamicsof the articulators is also applied on the regression estimates.The average root mean square error in estimating articulatorypositions, given the acoustics, is 1.56 mm for the nonparametricregression and 1.52 mm for the local linear regression.The local linear regression is found to perform significantlybetter than regression using Gaussian Mixture Modelsusing the same acoustic and articulatory features.

  • 10.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Ananthakrishnan, Gopal
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Enflo, Laura
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Musikakustik. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Automatic Prominence Classification in Swedish2010Inngår i: Proceedings of Speech Prosody 2010, Workshop on Prosodic Prominence, Chicago, USA, 2010Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This study aims at automatically classifying levels of acoustic prominence on a dataset of 200 Swedish sentences of read speech by one male native speaker. Each word in the sentences was categorized by four speech experts into one of three groups depending on the level of prominence perceived. Six acoustic features at a syllable level and seven features at a word level were used. Two machine learning algorithms, namely Support Vector Machines (SVM) and memory based Learning (MBL) were trained to classify the sentences into their respective classes. The MBL gave an average word level accuracy of 69.08% and the SVM gave an average accuracy of 65.17 % on the test set. These values were comparable with the average accuracy of the human annotators with respect to the average annotations. In this study, word duration was found to be the most important feature required for classifying prominence in Swedish read speech

  • 11.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT. KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Baklouti, M.
    Chetouani, M.
    Dutoit, T.
    Mahdhaoui, A.
    Martin, J. -C
    Ondas, S.
    Pelachaud, C.
    Urbain, J.
    Yilmaz, M.
    Generating Robot/Agent Backchannels During a Storytelling Experiment: 2009 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, VOLS 1-72009Inngår i: ICRA: 2009 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, 2009, s. 3749-3754Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This work presents the development of a real-time framework for the research of Multimodal Feedback of Robots/Talking Agents in the context of Human Robot Interaction (HRI) and Human Computer Interaction (HCI). For evaluating the framework, a Multimodal corpus is built (ENTERFACE_STEAD), and a study on the important multimodal features was done for building an active Robot/Agent listener of a storytelling experience with Humans. The experiments show that even when building the same reactive behavior models for Robot and Talking Agents, the interpretation and the realization of the behavior communicated is different due to the different communicative channels Robots/Agents offer be it physical but less-human-like in Robots, and virtual but more expressive and human-like in Talking agents.

  • 12.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    A novel Skype interface using SynFace for virtual speech reading support2011Inngår i: Proceedings from Fonetik 2011, June 8 - June 10, 2011: Speech, Music and Hearing, Quarterly Progress and Status Report, TMH-OPSR, Volume 51, 2011, Stockholm, Sweden, 2011, s. 33-36Konferansepaper (Annet vitenskapelig)
    Abstract [en]

    We describe in this paper a support client interface to the IP telephony application Skype. The system uses a variant of SynFace, a real-time speech reading support system using facial animation. The new interface is designed for the use by elderly persons, and tailored for use in systems supporting touch screens. The SynFace real-time facial animation system has previously shown ability to enhance speech comprehension for the hearing impaired persons. In this study weemploy at-home field studies on five subjects in the EU project MonAMI. We presentinsights from interviews with the test subjects on the advantages of the system, and onthe limitations of such a technology of real-time speech reading to reach the homesof elderly and the hard of hearing.

  • 13.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Effects of Visual Prominence Cues on Speech Intelligibility2009Inngår i: Proceedings of Auditory-Visual Speech Processing AVSP'09, Norwich, England, 2009Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This study reports experimental results on the effect of visual prominence, presented as gestures, on speech intelligibility. 30 acoustically vocoded sentences, permutated into different gestural conditions were presented audio-visually to 12 subjects. The analysis of correct word recognition shows a significant increase in intelligibility when focally-accented (prominent) words are supplemented with head-nods or with eye-brow raise gestures. The paper also examines coupling other acoustic phenomena to brow-raise gestures. As a result, the paper introduces new evidence on the ability of the non-verbal movements in the visual modality to support audio-visual speech perception.

  • 14.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Perception of Nonverbal Gestures of Prominence in Visual Speech Animation2010Inngår i: Proceedings of the ACM/SSPNET 2nd International Symposium on Facial Analysis and Animation, Edinburgh, UK, 2010, s. 25-Konferansepaper (Fagfellevurdert)
    Abstract [en]

    It has long been recognized that visual speech information is important for speech perception [McGurk and MacDonald 1976] [Summerfield 1992]. Recently there has been an increasing interest in the verbal and non-verbal interaction between the visual and the acoustic modalities from production and perception perspectives. One of the prosodic phenomena which attracts much focus is prominence. Prominence is defined as when a linguistic segment is made salient in its context.

  • 15.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Prominence Detection in Swedish Using Syllable Correlates2010Inngår i: Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, Makuhari, Japan, 2010, s. 1784-1787Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This paper presents an approach to estimating word level prominence in Swedish using syllable level features. The paper discusses the mismatch problem of annotations between word level perceptual prominence and its acoustic correlates, context, and data scarcity. 200 sentences are annotated by 4 speech experts with prominence on 3 levels. A linear model for feature extraction is proposed on a syllable level features, and weights of these features are optimized to match word level annotations. We show that using syllable level features and estimating weights for the acoustic correlates to minimize the word level estimation error gives better detection accuracy compared to word level features, and that both features exceed the baseline accuracy.

  • 16.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Blomberg, Mats
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Mirning, N.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Talking with Furhat - multi-party interaction with a back-projected robot head2012Inngår i: Proceedings of Fonetik 2012, Gothenberg, Sweden, 2012, s. 109-112Konferansepaper (Annet vitenskapelig)
    Abstract [en]

    This is a condensed presentation of some recent work on a back-projected robotic head for multi-party interaction in public settings. We will describe some of the design strategies and give some preliminary analysis of an interaction database collected at the Robotville exhibition at the London Science Museum

  • 17.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Bollepalli, Bajibabu
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Hussen-Abdelaziz, A.
    Johansson, Martin
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Koutsombogera, M.
    Lopes, J. D.
    Novikova, J.
    Oertel, Catharine
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Stefanov, Kalin
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Varol, G.
    Human-robot Collaborative Tutoring Using Multiparty Multimodal Spoken Dialogue2014Konferansepaper (Fagfellevurdert)
    Abstract [en]

    In this paper, we describe a project that explores a novel experi-mental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robotinteraction setup is designed, and a human-human dialogue corpus is collect-ed. The corpus targets the development of a dialogue system platform to study verbal and nonverbaltutoring strategies in mul-tiparty spoken interactions with robots which are capable of spo-ken dialogue. The dialogue task is centered on two participants involved in a dialogueaiming to solve a card-ordering game. Along with the participants sits a tutor (robot) that helps the par-ticipants perform the task, and organizes and balances their inter-action. Differentmultimodal signals captured and auto-synchronized by different audio-visual capture technologies, such as a microphone array, Kinects, and video cameras, were coupled with manual annotations. These are used build a situated model of the interaction based on the participants personalities, their state of attention, their conversational engagement and verbal domi-nance, and how that is correlated with the verbal and visual feed-back, turn-management, and conversation regulatory actions gen-erated by the tutor. Driven by the analysis of the corpus, we will show also the detailed design methodologies for an affective, and multimodally rich dialogue system that allows the robot to meas-ure incrementally the attention states, and the dominance for each participant, allowing the robot head Furhat to maintain a well-coordinated, balanced, and engaging conversation, that attempts to maximize the agreement and the contribution to solve the task. This project sets the first steps to explore the potential of us-ing multimodal dialogue systems to build interactive robots that can serve in educational, team building, and collaborative task solving applications.

  • 18.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Bollepalli, Bajibabu
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Hussen-Abdelaziz, A.
    Johansson, Martin
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Koutsombogera, M.
    Lopes, J.
    Novikova, J.
    Oertel, Catharine
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Stefanov, Kalin
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Varol, G.
    Tutoring Robots: Multiparty Multimodal Social Dialogue With an Embodied Tutor2014Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This project explores a novel experimental setup towards building spoken, multi-modally rich, and human-like multiparty tutoring agent. A setup is developed and a corpus is collected that targets the development of a dialogue system platform to explore verbal and nonverbal tutoring strategies in multiparty spoken interactions with embodied agents. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. With the participants sits a tutor that helps the participants perform the task and organizes and balances their interaction. Different multimodal signals captured and auto-synchronized by different audio-visual capture technologies were coupled with manual annotations to build a situated model of the interaction based on the participants personalities, their temporally-changing state of attention, their conversational engagement and verbal dominance, and the way these are correlated with the verbal and visual feedback, turn-management, and conversation regulatory actions generated by the tutor. At the end of this chapter we discuss the potential areas of research and developments this work opens and some of the challenges that lie in the road ahead.

  • 19.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Edlund, Jens
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    House, David
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Animated Faces for Robotic Heads: Gaze and Beyond2011Inngår i: Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues / [ed] Anna Esposito, Alessandro Vinciarelli, Klára Vicsi, Catherine Pelachaud and Anton Nijholt, Springer Berlin/Heidelberg, 2011, s. 19-35Konferansepaper (Fagfellevurdert)
    Abstract [en]

    We introduce an approach to using animated faces for robotics where a static physical object is used as a projection surface for an animation. The talking head is projected onto a 3D physical head model. In this chapter we discuss the different benefits this approach adds over mechanical heads. After that, we investigate a phenomenon commonly referred to as the Mona Lisa gaze effect. This effect results from the use of 2D surfaces to display 3D images and causes the gaze of a portrait to seemingly follow the observer no matter where it is viewed from. The experiment investigates the perception of gaze direction by observers. The analysis shows that the 3D model eliminates the effect, and provides an accurate perception of gaze direction. We discuss at the end the different requirements of gaze in interactive systems, and explore the different settings these findings give access to.

  • 20.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Auditory visual prominence From intelligibility to behavior2009Inngår i: Journal on Multimodal User Interfaces, ISSN 1783-7677, E-ISSN 1783-8738, Vol. 3, nr 4, s. 299-309Artikkel i tidsskrift (Fagfellevurdert)
    Abstract [en]

    Auditory prominence is defined as when an acoustic segment is made salient in its context. Prominence is one of the prosodic functions that has been shown to be strongly correlated with facial movements. In this work, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study, a speech intelligibility experiment is conducted, speech quality is acoustically degraded and the fundamental frequency is removed from the signal, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrows raise gestures, which are synchronized with the auditory prominence. The experiment shows that presenting prominence as facial gestures significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a follow-up study examining the perception of the behavior of the talking heads when gestures are added over pitch accents. Using eye-gaze tracking technology and questionnaires on 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch accents opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and the understanding of the talking head.

  • 21.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Mirning, Nicole
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Tscheligi, Manfred
    Furhat goes to Robotville: a large-scale multiparty human-robot interaction data collection in a public space2012Inngår i: Proc of LREC Workshop on Multimodal Corpora, Istanbul, Turkey, 2012Konferansepaper (Fagfellevurdert)
    Abstract [en]

    In the four days of the Robotville exhibition at the London Science Museum, UK, during which the back-projected head Furhat in a situated spoken dialogue system was seen by almost 8 000 visitors, we collected a database of 10 000 utterances spoken to Furhat in situated interaction. The data collection is an example of a particular kind of corpus collection of human-machine dialogues in public spaces that has several interesting and specific characteristics, both with respect to the technical details of the collection and with respect to the resulting corpus contents. In this paper, we take the Furhat data collection as a starting point for a discussion of the motives for this type of data collection, its technical peculiarities and prerequisites, and the characteristics of the resulting corpus.

  • 22.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC).
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    House, David
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence2010Inngår i: 3rd COST 2102 International Training School on Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces: Theoretical and Practical Issues / [ed] Esposito A; Esposito AM; Martone R; Muller VC; Scarpetta G, 2010, Vol. 6456, s. 55-71Konferansepaper (Fagfellevurdert)
    Abstract [en]

    In this chapter, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study a speech intelligibility experiment is conducted, where speech quality is acoustically degraded, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrow raising gestures. The experiment shows that perceiving visual prominence as gestures, synchronized with the auditory prominence, significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a study examining the perception of the behavior of the talking heads when gestures are added at pitch movements. Using eye-gaze tracking technology and questionnaires for 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch movements opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and helpfulness of the talking head.

  • 23.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    SynFace Phone Recognizer for Swedish Wideband and Narrowband Speech2008Inngår i: Proceedings of The second Swedish Language Technology Conference (SLTC), Stockholm, Sweden., 2008, s. 3-6Konferansepaper (Annet vitenskapelig)
    Abstract [en]

    In this paper, we present new results and comparisons of the real-time lips synchronized talking head SynFace on different Swedish databases and bandwidth. The work involves training SynFace on narrow-band telephone speech from the Swedish SpeechDat, and on the narrow-band and wide-band Speecon corpus. Auditory perceptual tests are getting established for SynFace as an audio visual hearing support for the hearing-impaired. Preliminary results show high recognition accuracy compared to other languages.

  • 24.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Spontaneous spoken dialogues with the Furhat human-like robot head2014Inngår i: HRI '14 Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction, Bielefeld, Germany, 2014, s. 326-Konferansepaper (Fagfellevurdert)
    Abstract [en]

    We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is an anthropomorphic robot head that utilizes facial animation for physical robot heads using back-projection. In the system, multimodality is enabled using speech and rich visual input signals such as multi-person real-time face tracking and microphone tracking. The demonstrator will showcase a system that is able to carry out social dialogue with multiple interlocutors simultaneously with rich output signals such as eye and head coordination, lips synchronized speech synthesis, and non-verbal facial gestures used to regulate fluent and expressive multiparty conversations. The dialogue design is performed using the IrisTK [4] dialogue authoring toolkit developed at KTH. The system will also be able to perform a moderator in a quiz-game showing different strategies for regulating spoken situated interactions.

  • 25.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    The Furhat Social Companion Talking Head2013Inngår i: Interspeech 2013 - Show and Tell, 2013, s. 747-749Konferansepaper (Fagfellevurdert)
    Abstract [en]

    In this demonstrator we present the Furhat robot head. Furhat is a highly human-like robot head in terms of dynamics, thanks to its use of back-projected facial animation. Furhat also takes advantage of a complex and advanced dialogue toolkits designed to facilitate rich and fluent multimodal multiparty human-machine situated and spoken dialogue. The demonstrator will present a social dialogue system with Furhat that allows for several simultaneous interlocutors, and takes advantage of several verbal and nonverbal input signals such as speech input, real-time multi-face tracking, and facial analysis, and communicates with its users in a mixed initiative dialogue, using state of the art speech synthesis, with rich prosody, lip animated facial synthesis, eye and head movements, and gestures.

  • 26.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Furhat: A Back-projected Human-like Robot Head for Multiparty Human-Machine Interaction2012Inngår i: Cognitive Behavioural Systems: COST 2102 International Training School, Dresden, Germany, February 21-26, 2011, Revised Selected Papers / [ed] Anna Esposito, Antonietta M. Esposito, Alessandro Vinciarelli, Rüdiger Hoffmann, Vincent C. Müller, Springer Berlin/Heidelberg, 2012, s. 114-130Konferansepaper (Fagfellevurdert)
    Abstract [en]

    In this chapter, we first present a summary of findings from two previous studies on the limitations of using flat displays with embodied conversational agents (ECAs) in the contexts of face-to-face human-agent interaction. We then motivate the need for a three dimensional display of faces to guarantee accurate delivery of gaze and directional movements and present Furhat, a novel, simple, highly effective, and human-like back-projected robot head that utilizes computer animation to deliver facial movements, and is equipped with a pan-tilt neck. After presenting a detailed summary on why and how Furhat was built, we discuss the advantages of using optically projected animated agents for interaction. We discuss using such agents in terms of situatedness, environment, context awareness, and social, human-like face-to-face interaction with robots where subtle nonverbal and social facial signals can be communicated. At the end of the chapter, we present a recent application of Furhat as a multimodal multiparty interaction system that was presented at the London Science Museum as part of a robot festival,. We conclude the paper by discussing future developments, applications and opportunities of this technology.

  • 27.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Öster, Anne-Marie
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    van Son, Nic
    Viataal, Nijmegen, The Netherlands.
    Ormel, Ellen
    Viataal, Nijmegen, The Netherlands.
    Herzke, Tobias
    HörTech gGmbH, Germany.
    Studies on Using the SynFace Talking Head for the Hearing Impaired2009Inngår i: Proceedings of Fonetik'09: The XXIIth Swedish Phonetics Conference, June 10-12, 2009 / [ed] Peter Branderud, Hartmut Traunmüller, Stockholm: Stockholm University, 2009, s. 140-143Konferansepaper (Annet vitenskapelig)
    Abstract [en]

    SynFace is a lip-synchronized talking agent which is optimized as a visual reading support for the hearing impaired. In this paper wepresent the large scale hearing impaired user studies carried out for three languages in the Hearing at Home project. The user tests focuson measuring the gain in Speech Reception Threshold in Noise and the effort scaling when using SynFace by hearing impaired people, where groups of hearing impaired subjects with different impairment levels from mild to severe and cochlear implants are tested. Preliminaryanalysis of the results does not show significant gain in SRT or in effort scaling. But looking at large cross-subject variability in both tests, it isclear that many subjects benefit from SynFace especially with speech with stereo babble.

  • 28.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Öster, Ann-Marie
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    van Son, Nic
    Ormel, Ellen
    Virtual Speech Reading Support for Hard of Hearing in a Domestic Multi-Media Setting2009Inngår i: INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2009, s. 1443-1446Konferansepaper (Fagfellevurdert)
    Abstract [en]

    In this paper we present recent results on the development of the SynFace lip synchronized talking head towards multilinguality, varying signal conditions and noise robustness in the Hearing at Home project. We then describe the large scale hearing impaired user studies carried out for three languages. The user tests focus on measuring the gain in Speech Reception Threshold in Noise when using SynFace, and on measuring the effort scaling when using SynFace by hearing impaired people. Preliminary analysis of the results does not show significant gain in SRT or in effort scaling. But looking at inter-subject variability, it is clear that many subjects benefit from SynFace especially with speech with stereo babble noise.

  • 29.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT. KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    De Smet, Michael
    Van Hamme, Hugo
    Lip Synchronization: from Phone Lattice to PCA Eigen-projections using Neural Networks2008Inngår i: INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2008, s. 2016-2019Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Lip synchronization is the process of generating natural lip movements from a speech signal. In this work we address the lip-sync problem using an automatic phone recognizer that generates a phone lattice carrying posterior probabilities. The acoustic feature vector contains the posterior probabilities of all the phones over a time window centered at the current time point. Hence this representation characterizes the phone recognition output including the confusion patterns caused by its limited accuracy. A 3D face model with varying texture is computed by analyzing a video recording of the speaker using a 3D morphable model. Training a neural network using 30 000 data vectors from an audiovisual recording in Dutch resulted in a very good simulation of the face on independent data sets of the same or of a different speaker.

  • 30.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Edlund, Jens
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Taming Mona Lisa: communicating gaze faithfully in 2D and 3D facial projections2012Inngår i: ACM Transactions on Interactive Intelligent Systems, ISSN 2160-6455, Vol. 1, nr 2, s. 25-, artikkel-id 11Artikkel i tidsskrift (Fagfellevurdert)
    Abstract [en]

    The perception of gaze plays a crucial role in human-human interaction. Gaze has been shown to matter for a number of aspects of communication and dialogue, especially for managing the flow of the dialogue and participant attention, for deictic referencing, and for the communication of attitude. When developing embodied conversational agents (ECAs) and talking heads, modeling and delivering accurate gaze targets is crucial. Traditionally, systems communicating through talking heads have been displayed to the human conversant using 2D displays, such as flat monitors. This approach introduces severe limitations for an accurate communication of gaze since 2D displays are associated with several powerful effects and illusions, most importantly the Mona Lisa gaze effect, where the gaze of the projected head appears to follow the observer regardless of viewing angle. We describe the Mona Lisa gaze effect and its consequences in the interaction loop, and propose a new approach for displaying talking heads using a 3D projection surface (a physical model of a human head) as an alternative to the traditional flat surface projection. We investigate and compare the accuracy of the perception of gaze direction and the Mona Lisa gaze effect in 2D and 3D projection surfaces in a five subject gaze perception experiment. The experiment confirms that a 3Dprojection surface completely eliminates the Mona Lisa gaze effect and delivers very accurate gaze direction that is independent of the observer's viewing angle. Based on the data collected in this experiment, we rephrase the formulation of the Mona Lisa gaze effect. The data, when reinterpreted, confirms the predictions of the new model for both 2D and 3D projection surfaces. Finally, we discuss the requirements on different spatially interactive systems in terms of gaze direction, and propose new applications and experiments for interaction in a human-ECA and a human-robot settings made possible by this technology.

  • 31.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Edlund, Jens
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Analysis of gaze and speech patterns in three-party quiz game interaction2013Inngår i: Interspeech 2013, 2013, s. 1126-1130Konferansepaper (Fagfellevurdert)
    Abstract [en]

    In order to understand and model the dynamics between interaction phenomena such as gaze and speech in face-to-face multiparty interaction between humans, we need large quantities of reliable, objective data of such interactions. To date, this type of data is in short supply. We present a data collection setup using automated, objective techniques in which we capture the gaze and speech patterns of triads deeply engaged in a high-stakes quiz game. The resulting corpus consists of five one-hour recordings, and is unique in that it makes use of three state-of-the-art gaze trackers (one per subject) in combination with a state-of-theart conical microphone array designed to capture roundtable meetings. Several video channels are also included. In this paper we present the obstacles we encountered and the possibilities afforded by a synchronised, reliable combination of large-scale multi-party speech and gaze data, and an overview of the first analyses of the data. Index Terms: multimodal corpus, multiparty dialogue, gaze patterns, multiparty gaze.

  • 32.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Heylen, D.
    Bohus, D.
    Koutsombogera, Maria
    Papageorgiou, H.
    Esposito, A.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    UM3I 2014: International workshop on understanding and modeling multiparty, multimodal interactions2014Inngår i: ICMI 2014 - Proceedings of the 2014 International Conference on Multimodal Interaction, Association for Computing Machinery (ACM), 2014, s. 537-538Konferansepaper (Fagfellevurdert)
    Abstract [en]

    In this paper, we present a brief summary of the international workshop on Modeling Multiparty, Multimodal Interactions. The UM3I 2014 workshop is held in conjunction with the ICMI 2014 conference. The workshop will highlight recent developments and adopted methodologies in the analysis and modeling of multiparty and multimodal interactions, the design and implementation principles of related human-machine interfaces, as well as the identification of potential limitations and ways of overcoming them.

  • 33.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Effects of 2D and 3D Displays on Turn-taking Behavior in Multiparty Human-Computer Dialog2011Inngår i: SemDial 2011: Proceedings of the 15th Workshop on the Semantics and Pragmatics of Dialogue / [ed] Ron Artstein, Mark Core, David DeVault, Kallirroi Georgila, Elsi Kaiser, Amanda Stent, Los Angeles, CA, 2011, s. 192-193Konferansepaper (Fagfellevurdert)
    Abstract [en]

    The perception of gaze from an animated agenton a 2D display has been shown to suffer fromthe Mona Lisa effect, which means that exclusive mutual gaze cannot be established if there is more than one observer. In this study, we investigate this effect when it comes to turntaking control in a multi-party human-computerdialog setting, where a 2D display is compared to a 3D projection. The results show that the 2D setting results in longer response times andlower turn-taking accuracy.

  • 34.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Perception of Gaze Direction for Situated Interaction2012Inngår i: Proceedings of the 4th Workshop on Eye Gaze in Intelligent Human Machine Interaction, Gaze-In 2012, ACM , 2012Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Accurate human perception of robots' gaze direction is crucial for the design of a natural and fluent situated multimodal face-to-face interaction between humans and machines. In this paper, we present an experiment targeted at quantifying the effects of different gaze cues synthesized using the Furhat back-projected robot head, on the accuracy of perceived spatial direction of gaze by humans using 18 test subjects. The study first quantifies the accuracy of the perceived gaze direction in a human-human setup, and compares that to the use of synthesized gaze movements in different conditions: viewing the robot eyes frontal or at a 45 degrees angle side view. We also study the effect of 3D gaze by controlling both eyes to indicate the depth of the focal point (vergence), the use of gaze or head pose, and the use of static or dynamic eyelids. The findings of the study are highly relevant to the design and control of robots and animated agents in situated face-to-face interaction.

  • 35.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Turn-taking Control Using Gaze in Multiparty Human-Computer Dialogue: Effects of 2D and 3D Displays2011Inngår i: Proceedings of the International Conference on Audio-Visual Speech Processing 2011, Stockholm: KTH Royal Institute of Technology, 2011, s. 99-102Konferansepaper (Fagfellevurdert)
    Abstract [en]

    In a previous experiment we found that the perception of gazefrom an animated agent on a two-dimensional display suffersfrom the Mona Lisa effect, which means that exclusive mutual gaze cannot be established if there is more than one observer. By using a three-dimensional projection surface, this effect can be eliminated. In this study, we investigate whether this difference also holds for the turn-taking behaviour of subjects interacting with the animated agent in a multi-party dialogue. We present a Wizard-of-Oz experiment where five subjects talk toan animated agent in a route direction dialogue. The results show that the subjects to some extent can infer the intended target of the agent’s questions, in spite of the Mona Lisa effect, but that the accuracy of gaze when it comes to selecting an addressee is still significantly lower in the 2D condition, ascompared to the 3D condition. The response time is also significantly longer in the 2D condition, indicating that the inference of intended gaze may require additional cognitive efforts.

  • 36.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Lip-reading: Furhat audio visual intelligibility of a back projected animated face2012Inngår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer Berlin/Heidelberg, 2012, s. 196-203Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Back projecting a computer animated face, onto a three dimensional static physical model of a face, is a promising technology that is gaining ground as a solution to building situated, flexible and human-like robot heads. In this paper, we first briefly describe Furhat, a back projected robot head built for the purpose of multimodal multiparty human-machine interaction, and its benefits over virtual characters and robotic heads; and then motivate the need to investigating the contribution to speech intelligibility Furhat's face offers. We present an audio-visual speech intelligibility experiment, in which 10 subjects listened to short sentences with degraded speech signal. The experiment compares the gain in intelligibility between lip reading a face visualized on a 2D screen compared to a 3D back-projected face and from different viewing angles. The results show that the audio-visual speech intelligibility holds when the avatar is projected onto a static face model (in the case of Furhat), and even, rather surprisingly, exceeds it. This means that despite the movement limitations back projected animated face models bring about; their audio visual speech intelligibility is equal, or even higher, compared to the same models shown on flat displays. At the end of the paper we discuss several hypotheses on how to interpret the results, and motivate future investigations to better explore the characteristics of visual speech perception 3D projected faces.

  • 37.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    The Furhat Back-Projected Humanoid Head-Lip Reading, Gaze And Multi-Party Interaction2013Inngår i: International Journal of Humanoid Robotics, ISSN 0219-8436, Vol. 10, nr 1, s. 1350005-Artikkel i tidsskrift (Fagfellevurdert)
    Abstract [en]

    In this paper, we present Furhat - a back-projected human-like robot head using state-of-the art facial animation. Three experiments are presented where we investigate how the head might facilitate human - robot face-to-face interaction. First, we investigate how the animated lips increase the intelligibility of the spoken output, and compare this to an animated agent presented on a flat screen, as well as to a human face. Second, we investigate the accuracy of the perception of Furhat's gaze in a setting typical for situated interaction, where Furhat and a human are sitting around a table. The accuracy of the perception of Furhat's gaze is measured depending on eye design, head movement and viewing angle. Third, we investigate the turn-taking accuracy of Furhat in a multi-party interactive setting, as compared to an animated agent on a flat screen. We conclude with some observations from a public setting at a museum, where Furhat interacted with thousands of visitors in a multi-party interaction.

  • 38.
    Al Moubayed, Samer
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Stefanov, Kalin
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Multimodal Multiparty Social Interaction with the Furhat Head2012Konferansepaper (Fagfellevurdert)
    Abstract [en]

    We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is a human-like interface that utilizes facial animation for physical robot heads using back-projection. In the system, multimodality is enabled using speech and rich visual input signals such as multi-person real-time face tracking and microphone tracking. The demonstrator will showcase a system that is able to carry out social dialogue with multiple interlocutors simultaneously with rich output signals such as eye and head coordination, lips synchronized speech synthesis, and non-verbal facial gestures used to regulate fluent and expressive multiparty conversations.

  • 39.
    Beskow, Jonas
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Perception of Gaze Direction in 2D and 3D Facial Projections2010Inngår i: The ACM / SSPNET 2nd International Symposium on Facial Analysis and Animation, New York, USA: ACM Press, 2010, s. 24-24Konferansepaper (Fagfellevurdert)
    Abstract [en]

    In human-human communication, eye gaze is a fundamental cue in e.g. turn-taking and interaction control [Kendon 1967]. Accurate control of gaze direction is therefore crucial in many applications of animated avatars striving to simulate human interactional behaviors. One inherent complication when conveying gaze direction through a 2D display, however, is what has been referred to as the Mona Lisa effect; if the avatar is gazing towards the camera, the eyes seem to "follow" the beholder whatever vantage point he or she may assume [Boyarskaya and Hecht 2010]. This becomes especially problematic in applications where multiple persons are interacting with the avatar, and the system needs to use gaze to address a specific person. Introducing 3D structure in the facial display, e.g. projecting the avatar face on a face mask, makes the percept of the avatar's gazechange with the viewing angle, as is indeed the case with real faces. To this end, [Delaunay et al. 2010] evaluated two back-projected displays - a spherical "dome" and a face shaped mask. However, there may be many factors influencing gaze directionpercieved from a 3D facial display, so an accurate calibration procedure for gaze directionis called for.

  • 40.
    Beskow, Jonas
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Alexanderson, Simon
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Edlund, Jens
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    House, David
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Kinetic Data for Large-Scale Analysis and Modeling of Face-to-Face Conversation2011Inngår i: Proceedings of International Conference on Audio-Visual Speech Processing 2011 / [ed] Salvi, G.; Beskow, J.; Engwall, O.; Al Moubayed, S., Stockholm: KTH Royal Institute of Technology, 2011, s. 103-106Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Spoken face to face interaction is a rich and complex form of communication that includes a wide array of phenomena thatare not fully explored or understood. While there has been extensive studies on many aspects in face-to-face interaction, these are traditionally of a qualitative nature, relying on hand annotated corpora, typically rather limited in extent, which is a natural consequence of the labour intensive task of multimodal data annotation. In this paper we present a corpus of 60 hours of unrestricted Swedish face-to-face conversations recorded with audio, video and optical motion capture, and we describe a new project setting out to exploit primarily the kinetic data in this corpus in order to gain quantitative knowledge on humanface-to-face interaction.

  • 41.
    Beskow, Jonas
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT. KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT. KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Nordqvist, Peter
    KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT. KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Centrum för Talteknologi, CTT. KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Herzke, Tobias
    Schulz, Arne
    Hearing at Home: Communication support in home environments for hearing impaired persons2008Inngår i: INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, BAIXAS: ISCA-INST SPEECH COMMUNICATION ASSOC , 2008, s. 2203-2206Konferansepaper (Fagfellevurdert)
    Abstract [en]

    The Hearing at Home (HaH) project focuses on the needs of hearing-impaired people in home environments. The project is researching and developing an innovative media-center solution for hearing support, with several integrated features that support perception of speech and audio, such as individual loudness amplification, noise reduction, audio classification and event detection, and the possibility to display an animated talking head providing real-time speechreading support. In this paper we provide a brief project overview and then describe some recent results related to the audio classifier and the talking head. As the talking head expects clean speech input, an audio classifier has been developed for the task of classifying audio signals as clean speech, speech in noise or other. The mean accuracy of the classifier was 82%. The talking head (based on technology from the SynFace project) has been adapted for German, and a small speech-in-noise intelligibility experiment was conducted where sentence recognition rates increased from 3% to 17% when the talking head was present.

  • 42.
    Beskow, Jonas
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Salvi, Giampiero
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    SynFace: Verbal and Non-verbal Face Animation from Audio2009Inngår i: Proceedings of The International Conference on Auditory-Visual Speech Processing AVSP'09 / [ed] Barry-John Theobald, Richard Harvey, Norwich, England, 2009Konferansepaper (Fagfellevurdert)
    Abstract [en]

    We give an overview of SynFace, a speech-drivenface animation system originally developed for theneeds of hard-of-hearing users of the telephone. Forthe 2009 LIPS challenge, SynFace includes not onlyarticulatory motion but also non-verbal motion ofgaze, eyebrows and head, triggered by detection ofacoustic correlates of prominence and cues for interactioncontrol. In perceptual evaluations, both verbaland non-verbal movmements have been found to havepositive impact on word recognition scores.

  • 43. Bisitz, T.
    et al.
    Herzke, T.
    Zokoll, M.
    Öster, Anne-Marie
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Ormel, E.
    Van Son, N.
    Tanke, R.
    Noise Reduction for Media Streams2009Inngår i: NAG/DAGA'09 International Conference on Acoustics: including the 35th German Annual Conference on Acoustics (DAGA) / [ed] Marinus M. Boone, Red Hook, USA: Curran Associates, Inc., 2009Konferansepaper (Fagfellevurdert)
  • 44.
    Blomberg, Mats
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Skantze, Gabriel
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Granström, Björn
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Children and adults in dialogue with the robot head Furhat - corpus collection and initial analysis2012Inngår i: Proceedings of WOCCI, Portland, OR, 2012Konferansepaper (Fagfellevurdert)
  • 45.
    Edlund, Jens
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Co-present or Not?: Embodiment, Situatedness and the Mona Lisa Gaze Effect2013Inngår i: Eye gaze in intelligent user interfaces: gaze-based analyses, models and applications / [ed] Nakano, Yukiko; Conati, Cristina; Bader, Thomas, London: Springer London, 2013, s. 185-203Kapittel i bok, del av antologi (Fagfellevurdert)
    Abstract [en]

    The interest in embodying and situating computer programmes took off in the autonomous agents community in the 90s. Today, researchers and designers of programmes that interact with people on human terms endow their systems with humanoid physiognomies for a variety of reasons. In most cases, attempts at achieving this embodiment and situatedness has taken one of two directions: virtual characters and actual physical robots. In addition, a technique that is far from new is gaining ground rapidly: projection of animated faces on head-shaped 3D surfaces. In this chapter, we provide a history of this technique; an overview of its pros and cons; and an in-depth description of the cause and mechanics of the main drawback of 2D displays of 3D faces (and objects): the Mona Liza gaze effect. We conclude with a description of an experimental paradigm that measures perceived directionality in general and the Mona Lisa gaze effect in particular.

  • 46.
    Edlund, Jens
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Beskow, Jonas
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    The Mona Lisa Gaze Effect as an Objective Metric for Perceived Cospatiality2011Inngår i: Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2011) / [ed] Vilhjálmsson, Hannes Högni; Kopp, Stefan; Marsella, Stacy; Thórisson, Kristinn R., Springer , 2011, s. 439-440Konferansepaper (Fagfellevurdert)
    Abstract [en]

    We propose to utilize the Mona Lisa gaze effect for an objective and repeatable measure of the extent to which a viewer perceives an object as cospatial. Preliminary results suggest that the metric behaves as expected.

  • 47.
    Edlund, Jens
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Tånnander, Christina
    Swedish Agency for Accessible Media, MTM, Stockholm, Sweden.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Audience response system based annotation of speech2013Inngår i: Proceedings of Fonetik 2013, Linköping: Linköping University , 2013, s. 13-16Konferansepaper (Annet vitenskapelig)
    Abstract [en]

    Manual annotators are often used to label speech. The task is associated with high costs and with great time consumption. We suggest to reach an increased throughput while maintaining a high measure of experimental control by borrowing from the Audience Response Systems used in the film and television industries, and demonstrate a cost-efficient setup for rapid, plenary annotation of phenomena occurring in recorded speech together with some results from studies we have undertaken to quantify the temporal precision and reliability of such annotations.

  • 48.
    Edlund, Jens
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Tånnander, Christina
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Temporal precision and reliability of audience response system based annotation2013Inngår i: Proc. of Multimodal Corpora 2013, 2013Konferansepaper (Fagfellevurdert)
  • 49.
    Edlund, Jens
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Heldner, Mattias
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Gravano, Agustín
    Computer Science Department, University of Buenos Aires.
    Hirschberg, Julia
    Department of Computer Science, Columbia University.
    Very short utterances in conversation2010Inngår i: Proceedings from Fonetik 2010, Lund, June 2-4, 2010 / [ed] Susanne Schötz, Gilbert Ambrazaitis, Lund, Sweden: Lund University , 2010, s. 11-16Konferansepaper (Annet vitenskapelig)
    Abstract [en]

    Faced with the difficulties of finding an operationalized definition of backchannels, we have previously proposed an intermediate, auxiliary unit – the very short utterance (VSU) – which is defined operationally and is automatically extractable from recorded or ongoing dialogues. Here, we extend that work in the following ways: (1) we test the extent to which the VSU/NONVSU distinction corresponds to backchannels/non-backchannels in a different data set that is manually annotated for backchannels – the Columbia Games Corpus; (2) we examine to the extent to which VSUS capture other short utterances with a vocabulary similar to backchannels; (3) we propose a VSU method for better managing turn-taking and barge-ins in spoken dialogue systems based on detection of backchannels; and (4) we attempt to detect backchannels with better precision by training a backchannel classifier using durations and inter-speaker relative loudness differences as features. The results show that VSUS indeed capture a large proportion of backchannels – large enough that VSUs can be used to improve spoken dialogue system turntaking; and that building a reliable backchannel classifier working in real time is feasible.

  • 50.
    Haider, Fasih
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Al Moubayed, Samer
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH, Tal-kommunikation.
    Towards speaker detection using lips movements for human-machine multiparty dialogue2012Inngår i: Proceedings of Fonetik'12, Gothenberg, Sweden, 2012, s. 117-120Konferansepaper (Fagfellevurdert)
12 1 - 50 of 57
RefereraExporteraLink til resultatlisten
Permanent link
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf