Change search
Link to record
Permanent link

Direct link
BETA
Publications (10 of 104) Show all publications
Kontogiorgos, D., Avramova, V., Alexanderson, S., Jonell, P., Oertel, C., Beskow, J., . . . Gustafson, J. (2018). A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018): . Paper presented at International Conference on Language Resources and Evaluation (LREC 2018) (pp. 119-127). Paris
Open this publication in new window or tab >>A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction
Show others...
2018 (English)In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 119-127Conference paper, Published paper (Refereed)
Abstract [en]

In this paper we present a corpus of multiparty situated interaction where participants collaborated on moving virtual objects on a large touch screen. A moderator facilitated the discussion and directed the interaction. The corpus contains recordings of a variety of multimodal data, in that we captured speech, eye gaze and gesture data using a multisensory setup (wearable eye trackers, motion capture and audio/video). Furthermore, in the description of the multimodal corpus, we investigate four different types of social gaze: referential gaze, joint attention, mutual gaze and gaze aversion by both perspectives of a speaker and a listener. We annotated the groups’ object references during object manipulation tasks and analysed the group’s proportional referential eye-gaze with regards to the referent object. When investigating the distributions of gaze during and before referring expressions we could corroborate the differences in time between speakers’ and listeners’ eye gaze found in earlier studies. This corpus is of particular interest to researchers who are interested in social eye-gaze patterns in turn-taking and referring language in situated multi-party interaction.

Place, publisher, year, edition, pages
Paris: , 2018
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-230238 (URN)979-10-95546-00-9 (ISBN)
Conference
International Conference on Language Resources and Evaluation (LREC 2018)
Note

QC 20180614

Available from: 2018-06-13 Created: 2018-06-13 Last updated: 2018-06-14Bibliographically approved
Jonell, P., Oertel, C., Kontogiorgos, D., Beskow, J. & Gustafson, J. (2018). Crowdsourced Multimodal Corpora Collection Tool. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018): . Paper presented at The Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 728-734). Paris
Open this publication in new window or tab >>Crowdsourced Multimodal Corpora Collection Tool
Show others...
2018 (English)In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 728-734Conference paper, Published paper (Refereed)
Abstract [en]

In recent years, more and more multimodal corpora have been created. To our knowledge there is no publicly available tool which allows for acquiring controlled multimodal data of people in a rapid and scalable fashion. We therefore are proposing (1) a novel tool which will enable researchers to rapidly gather large amounts of multimodal data spanning a wide demographic range, and (2) an example of how we used this tool for corpus collection of our "Attentive listener'' multimodal corpus. The code is released under an Apache License 2.0 and available as an open-source repository, which can be found at https://github.com/kth-social-robotics/multimodal-crowdsourcing-tool. This tool will allow researchers to set-up their own multimodal data collection system quickly and create their own multimodal corpora. Finally, this paper provides a discussion about the advantages and disadvantages with a crowd-sourced data collection tool, especially in comparison to a lab recorded corpora.

Place, publisher, year, edition, pages
Paris: , 2018
National Category
Engineering and Technology
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-230236 (URN)979-10-95546-00-9 (ISBN)
Conference
The Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Note

QC 20180618

Available from: 2018-06-13 Created: 2018-06-13 Last updated: 2018-06-18Bibliographically approved
Malisz, Z., Berthelsen, H., Beskow, J. & Gustafson, J. (2017). Controlling prominence realisation in parametric DNN-based speech synthesis. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2017: . Paper presented at 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, Stockholm, Sweden, 20 August 2017 through 24 August 2017 (pp. 1079-1083). International Speech Communication Association, 2017
Open this publication in new window or tab >>Controlling prominence realisation in parametric DNN-based speech synthesis
2017 (English)In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, International Speech Communication Association , 2017, Vol. 2017, p. 1079-1083Conference paper, Published paper (Refereed)
Abstract [en]

This work aims to improve text-To-speech synthesis forWikipedia by advancing and implementing models of prosodic prominence. We propose a new system architecture with explicit prominence modeling and test the first component of the architecture. We automatically extract a phonetic feature related to prominence from the speech signal in the ARCTIC corpus. We then modify the label files and train an experimental TTS system based on the feature using Merlin, a statistical-parametric DNN-based engine. Test sentences with contrastive prominence on the word-level are synthesised and separate listening tests a) evaluating the level of prominence control in generated speech, and b) naturalness, are conducted. Our results show that the prominence feature-enhanced system successfully places prominence on the appropriate words and increases perceived naturalness relative to the baseline.

Place, publisher, year, edition, pages
International Speech Communication Association, 2017
Series
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISSN 2308-457X
Keywords
Deep neural networks, Prosodic prominence, Speech synthesis
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-222092 (URN)10.21437/Interspeech.2017-1355 (DOI)2-s2.0-85039164235 (Scopus ID)
Conference
18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, Stockholm, Sweden, 20 August 2017 through 24 August 2017
Note

QC 20180131

Available from: 2018-01-31 Created: 2018-01-31 Last updated: 2018-01-31Bibliographically approved
Zhang, Y., Beskow, J. & Kjellström, H. (2017). Look but Don’t Stare: Mutual Gaze Interaction in Social Robots. In: 9th International Conference on Social Robotics, ICSR 2017: . Paper presented at 9th International Conference on Social Robotics, ICSR 2017, Tsukuba, Japan, 22 November 2017 through 24 November 2017 (pp. 556-566). Springer, 10652
Open this publication in new window or tab >>Look but Don’t Stare: Mutual Gaze Interaction in Social Robots
2017 (English)In: 9th International Conference on Social Robotics, ICSR 2017, Springer, 2017, Vol. 10652, p. 556-566Conference paper, Published paper (Refereed)
Abstract [en]

Mutual gaze is a powerful cue for communicating social attention and intention. A plethora of studies have demonstrated the fundamental roles of mutual gaze in establishing communicative links between humans, and enabling non-verbal communication of social attention and intention. The amount of mutual gaze between two partners regulates human-human interaction and is a sign of social engagement. This paper investigates whether implementing mutual gaze in robotic systems can achieve social effects, thus to improve human robot interaction. Based on insights from existing human face-to-face interaction studies, we implemented an interactive mutual gaze model in an embodied agent, the social robot head Furhat. We evaluated the mutual gaze prototype with 24 participants in three applications. Our results show that our mutual gaze model improves social connectedness between robots and users.

Place, publisher, year, edition, pages
Springer, 2017
Series
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), ISSN 0302-9743 ; 10652
National Category
Interaction Technologies
Identifiers
urn:nbn:se:kth:diva-219664 (URN)10.1007/978-3-319-70022-9_55 (DOI)2-s2.0-85035749029 (Scopus ID)9783319700212 (ISBN)
Conference
9th International Conference on Social Robotics, ICSR 2017, Tsukuba, Japan, 22 November 2017 through 24 November 2017
Note

QC 20171211

Available from: 2017-12-11 Created: 2017-12-11 Last updated: 2018-05-24Bibliographically approved
Alexanderson, S., House, D. & Beskow, J. (2016). Automatic annotation of gestural units in spontaneous face-to-face interaction. In: MA3HMI 2016 - Proceedings of the Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction: . Paper presented at 2016 Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, MA3HMI 2016, 12 November 2016 through 16 November 2016 (pp. 15-19).
Open this publication in new window or tab >>Automatic annotation of gestural units in spontaneous face-to-face interaction
2016 (English)In: MA3HMI 2016 - Proceedings of the Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, 2016, p. 15-19Conference paper, Published paper (Refereed)
Abstract [en]

Speech and gesture co-occur in spontaneous dialogue in a highly complex fashion. There is a large variability in the motion that people exhibit during a dialogue, and different kinds of motion occur during different states of the interaction. A wide range of multimodal interface applications, for example in the fields of virtual agents or social robots, can be envisioned where it is important to be able to automatically identify gestures that carry information and discriminate them from other types of motion. While it is easy for a human to distinguish and segment manual gestures from a flow of multimodal information, the same task is not trivial to perform for a machine. In this paper we present a method to automatically segment and label gestural units from a stream of 3D motion capture data. The gestural flow is modeled with a 2-level Hierarchical Hidden Markov Model (HHMM) where the sub-states correspond to gesture phases. The model is trained based on labels of complete gesture units and self-adaptive manipulators. The model is tested and validated on two datasets differing in genre and in method of capturing motion, and outperforms a state-of-the-art SVM classifier on a publicly available dataset.

Keywords
Gesture recognition, Motion capture, Spontaneous dialogue, Hidden Markov models, Man machine systems, Markov processes, Online systems, 3D motion capture, Automatic annotation, Face-to-face interaction, Hierarchical hidden markov models, Multi-modal information, Multi-modal interfaces, Classification (of information)
National Category
Robotics
Identifiers
urn:nbn:se:kth:diva-202135 (URN)10.1145/3011263.3011268 (DOI)2-s2.0-85003571594 (Scopus ID)9781450345620 (ISBN)
Conference
2016 Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, MA3HMI 2016, 12 November 2016 through 16 November 2016
Funder
Swedish Research Council, 2010-4646
Note

Funding text: The work reported here is carried out within the projects: "Timing of intonation and gestures in spoken communication," (P12-0634:1) funded by the Bank of Sweden Tercentenary Foundation, and "Large-scale massively multimodal modelling of non-verbal behaviour in spontaneous dialogue," (VR 2010-4646) funded by Swedish Research Council.

Available from: 2017-03-13 Created: 2017-03-13 Last updated: 2017-11-24Bibliographically approved
Skantze, G., Johansson, M. & Beskow, J. (2015). A Collaborative Human-Robot Game as a Test-bed for Modelling Multi-party, Situated Interaction. In: INTELLIGENT VIRTUAL AGENTS, IVA 2015: . Paper presented at 15th International Conference on Intelligent Virtual Agents (IVA), AUG 26-28, 2015, Delft, NETHERLANDS (pp. 348-351).
Open this publication in new window or tab >>A Collaborative Human-Robot Game as a Test-bed for Modelling Multi-party, Situated Interaction
2015 (English)In: INTELLIGENT VIRTUAL AGENTS, IVA 2015, 2015, p. 348-351Conference paper, Published paper (Refereed)
Abstract [en]

In this demonstration we present a test-bed for collecting data and testing out models for multi-party, situated interaction between humans and robots. Two users are playing a collaborative card sorting game together with the robot head Furhat. The cards are shown on a touch table between the players, thus constituting a target for joint attention. The system has been exhibited at the Swedish National Museum of Science and Technology during nine days, resulting in a rich multi-modal corpus with users of mixed ages.

Series
Lecture Notes in Artificial Intelligence, ISSN 0302-9743 ; 9238
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-177091 (URN)10.1007/978-3-319-21996-7_37 (DOI)000363485400037 ()2-s2.0-84943635137 (Scopus ID)978-3-319-21996-7 (ISBN)978-3-319-21995-0 (ISBN)
Conference
15th International Conference on Intelligent Virtual Agents (IVA), AUG 26-28, 2015, Delft, NETHERLANDS
Note

QC 20151118

Available from: 2015-11-18 Created: 2015-11-13 Last updated: 2018-01-10Bibliographically approved
Skantze, G., Johansson, M. & Beskow, J. (2015). Exploring Turn-taking Cues in Multi-party Human-robot Discussions about Objects. In: Proceedings of the 2015 ACM International Conference on Multimodal Interaction: . Paper presented at ACM International Conference on Multimodal Interaction, ICMI 2015; Seattle, United States; 9-13 November 2015. Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Exploring Turn-taking Cues in Multi-party Human-robot Discussions about Objects
2015 (English)In: Proceedings of the 2015 ACM International Conference on Multimodal Interaction, Association for Computing Machinery (ACM), 2015Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we present a dialog system that was exhibited at the Swedish National Museum of Science and Technology. Two visitors at a time could play a collaborative card sorting game together with the robot head Furhat, where the three players discuss the solution together. The cards are shown on a touch table between the players, thus constituting a target for joint attention. We describe how the system was implemented in order to manage turn-taking and attention to users and objects in the shared physical space. We also discuss how multi-modal redundancy (from speech, card movements and head pose) is exploited to maintain meaningful discussions, given that the system has to process conversational speech from both children and adults in a noisy environment. Finally, we present an analysis of 373 interactions, where we investigate the robustness of the system, to what extent the system's attention can shape the users' turn-taking behaviour, and how the system can produce multi-modal turn-taking signals (filled pauses, facial gestures, breath and gaze) to deal with processing delays in the system.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2015
National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-180422 (URN)10.1145/2818346.2820749 (DOI)000380609500012 ()2-s2.0-84959259564 (Scopus ID)
Conference
ACM International Conference on Multimodal Interaction, ICMI 2015; Seattle, United States; 9-13 November 2015
Note

QC 20160331

Available from: 2016-01-13 Created: 2016-01-13 Last updated: 2018-01-10Bibliographically approved
House, D., Alexanderson, S. & Beskow, J. (2015). On the temporal domain of co-speech gestures: syllable, phrase or talk spurt?. In: Lundmark Svensson, M.; Ambrazaitis, G.; van de Weijer, J. (Ed.), Proceedings of Fonetik 2015: . Paper presented at Fonetik 2015, Lund (pp. 63-68).
Open this publication in new window or tab >>On the temporal domain of co-speech gestures: syllable, phrase or talk spurt?
2015 (English)In: Proceedings of Fonetik 2015 / [ed] Lundmark Svensson, M.; Ambrazaitis, G.; van de Weijer, J., 2015, p. 63-68Conference paper, Published paper (Other academic)
Abstract [en]

This study explores the use of automatic methods to detect and extract handgesture movement co-occuring with speech. Two spontaneous dyadic dialogueswere analyzed using 3D motion-capture techniques to track hand movement.Automatic speech/non-speech detection was performed on the dialogues resultingin a series of connected talk spurts for each speaker. Temporal synchrony of onsetand offset of gesture and speech was studied between the automatic hand gesturetracking and talk spurts, and compared to an earlier study of head nods andsyllable synchronization. The results indicated onset synchronization between headnods and the syllable in the short temporal domain and between the onset of longergesture units and the talk spurt in a more extended temporal domain.

National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-180407 (URN)
Conference
Fonetik 2015, Lund
Note

QC 20160216

Available from: 2016-01-13 Created: 2016-01-13 Last updated: 2018-01-10Bibliographically approved
Alexanderson, S. & Beskow, J. (2015). Towards Fully Automated Motion Capture of Signs -- Development and Evaluation of a Key Word Signing Avatar. ACM Transactions on Accessible Computing, 7(2), 7:1-7:17
Open this publication in new window or tab >>Towards Fully Automated Motion Capture of Signs -- Development and Evaluation of a Key Word Signing Avatar
2015 (English)In: ACM Transactions on Accessible Computing, ISSN 1936-7228, Vol. 7, no 2, p. 7:1-7:17Article in journal (Refereed) Published
Abstract [en]

Motion capture of signs provides unique challenges in the field of multimodal data collection. The dense packaging of visual information requires high fidelity and high bandwidth of the captured data. Even though marker-based optical motion capture provides many desirable features such as high accuracy, global fitting, and the ability to record body and face simultaneously, it is not widely used to record finger motion, especially not for articulated and syntactic motion such as signs. Instead, most signing avatar projects use costly instrumented gloves, which require long calibration procedures. In this article, we evaluate the data quality obtained from optical motion capture of isolated signs from Swedish sign language with a large number of low-cost cameras. We also present a novel dual-sensor approach to combine the data with low-cost, five-sensor instrumented gloves to provide a recording method with low manual postprocessing. Finally, we evaluate the collected data and the dual-sensor approach as transferred to a highly stylized avatar. The application of the avatar is a game-based environment for training Key Word Signing (KWS) as augmented and alternative communication (AAC), intended for children with communication disabilities.

Place, publisher, year, edition, pages
New York, NY, USA: Association for Computing Machinery (ACM), 2015
Keywords
Augmentative and alternative communication (AAC), Motion capture, Sign language, Virtual characters
National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-180427 (URN)10.1145/2764918 (DOI)000360070800004 ()2-s2.0-84935145760 (Scopus ID)
Note

 QC 2016-01-13

Available from: 2016-01-13 Created: 2016-01-13 Last updated: 2018-01-10Bibliographically approved
Alexanderson, S. & Beskow, J. (2014). Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions. Computer speech & language (Print), 28(2), 607-618
Open this publication in new window or tab >>Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions
2014 (English)In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 28, no 2, p. 607-618Article in journal (Refereed) Published
Abstract [en]

In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a,set of 180 short sentences, under three conditions: normal speech (in quiet), Lombard speech (in noise), and whispering. We then produced an animated 3D avatar with similar shape and appearance as the original talker and used an error minimization procedure to drive the animated version of the talker in a way that matched the original performance as closely as possible. In a perceptual intelligibility study with degraded audio we then compared the animated talker against the real talker and the audio alone, in terms of audio-visual word recognition rate across the three different production conditions. We found that the visual intelligibility of the animated talker was on par with the real talker for the Lombard and whisper conditions. In addition we created two incongruent conditions where normal speech audio was paired with animated Lombard speech or whispering. When compared to the congruent normal speech condition, Lombard animation yields a significant increase in intelligibility, despite the AV-incongruence. In a separate evaluation, we gathered subjective opinions on the different animations, and found that some degree of incongruence was generally accepted.

Keywords
Lombard effect, Motion capture, Speech-reading, Lip-reading, Facial animation, Audio-visual intelligibility
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-141052 (URN)10.1016/j.csl.2013.02.005 (DOI)000329415400017 ()2-s2.0-84890567121 (Scopus ID)
Funder
Swedish Research Council, VR 2010-4646
Note

QC 20140212

Available from: 2014-02-12 Created: 2014-02-07 Last updated: 2018-01-11Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-1399-6604

Search in DiVA

Show all publications