Change search
Link to record
Permanent link

Direct link
BETA
Publications (10 of 113) Show all publications
Székely, É., Henter, G. E., Beskow, J. & Gustafson, J. (2019). How to train your fillers: uh and um in spontaneous speech synthesis. In: : . Paper presented at The 10th ISCA Speech Synthesis Workshop.
Open this publication in new window or tab >>How to train your fillers: uh and um in spontaneous speech synthesis
2019 (English)Conference paper, Published paper (Refereed)
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-261693 (URN)
Conference
The 10th ISCA Speech Synthesis Workshop
Note

QC 20191011

Available from: 2019-10-10 Created: 2019-10-10 Last updated: 2019-10-11Bibliographically approved
Jonell, P., Kucherenko, T., Ekstedt, E. & Beskow, J. (2019). Learning Non-verbal Behavior for a Social Robot from YouTube Videos. In: : . Paper presented at ICDL-EpiRob Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions, Oslo, Norway, August 19, 2019.
Open this publication in new window or tab >>Learning Non-verbal Behavior for a Social Robot from YouTube Videos
2019 (English)Conference paper, Poster (with or without abstract) (Refereed)
Abstract [en]

Non-verbal behavior is crucial for positive perception of humanoid robots. If modeled well it can improve the interaction and leave the user with a positive experience, on the other hand, if it is modelled poorly it may impede the interaction and become a source of distraction. Most of the existing work on modeling non-verbal behavior show limited variability due to the fact that the models employed are deterministic and the generated motion can be perceived as repetitive and predictable. In this paper, we present a novel method for generation of a limited set of facial expressions and head movements, based on a probabilistic generative deep learning architecture called Glow. We have implemented a workflow which takes videos directly from YouTube, extracts relevant features, and trains a model that generates gestures that can be realized in a robot without any post processing. A user study was conducted and illustrated the importance of having any kind of non-verbal behavior while most differences between the ground truth, the proposed method, and a random control were not significant (however, the differences that were significant were in favor of the proposed method).

Keywords
Facial expressions, non-verbal behavior, generative models, neural network, head movement, social robotics
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-261242 (URN)
Conference
ICDL-EpiRob Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions, Oslo, Norway, August 19, 2019
Funder
Swedish Foundation for Strategic Research , RIT15-0107
Note

QC 20191007

Available from: 2019-10-03 Created: 2019-10-03 Last updated: 2019-10-07Bibliographically approved
Stefanov, K., Salvi, G., Kontogiorgos, D., Kjellström, H. & Beskow, J. (2019). Modeling of Human Visual Attention in Multiparty Open-World Dialogues. ACM TRANSACTIONS ON HUMAN-ROBOT INTERACTION, 8(2), Article ID UNSP 8.
Open this publication in new window or tab >>Modeling of Human Visual Attention in Multiparty Open-World Dialogues
Show others...
2019 (English)In: ACM TRANSACTIONS ON HUMAN-ROBOT INTERACTION, ISSN 2573-9522, Vol. 8, no 2, article id UNSP 8Article in journal (Refereed) Published
Abstract [en]

This study proposes, develops, and evaluates methods for modeling the eye-gaze direction and head orientation of a person in multiparty open-world dialogues, as a function of low-level communicative signals generated by his/hers interlocutors. These signals include speech activity, eye-gaze direction, and head orientation, all of which can be estimated in real time during the interaction. By utilizing these signals and novel data representations suitable for the task and context, the developed methods can generate plausible candidate gaze targets in real time. The methods are based on Feedforward Neural Networks and Long Short-Term Memory Networks. The proposed methods are developed using several hours of unrestricted interaction data and their performance is compared with a heuristic baseline method. The study offers an extensive evaluation of the proposed methods that investigates the contribution of different predictors to the accurate generation of candidate gaze targets. The results show that the methods can accurately generate candidate gaze targets when the person being modeled is in a listening state. However, when the person being modeled is in a speaking state, the proposed methods yield significantly lower performance.

Place, publisher, year, edition, pages
ASSOC COMPUTING MACHINERY, 2019
Keywords
Human-human interaction, open-world dialogue, eye-gaze direction, head orientation, multiparty
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:kth:diva-255203 (URN)10.1145/3323231 (DOI)000472066800003 ()
Note

QC 20190904

Available from: 2019-09-04 Created: 2019-09-04 Last updated: 2019-10-15Bibliographically approved
Skantze, G., Gustafson, J. & Beskow, J. (2019). Multimodal Conversational Interaction with Robots. In: Sharon Oviatt, Björn Schuller, Philip R. Cohen, Daniel Sonntag, Gerasimos Potamianos, Antonio Krüger (Ed.), The Handbook of Multimodal-Multisensor Interfaces, Volume 3: Language Processing, Software, Commercialization, and Emerging Directions. ACM Press
Open this publication in new window or tab >>Multimodal Conversational Interaction with Robots
2019 (English)In: The Handbook of Multimodal-Multisensor Interfaces, Volume 3: Language Processing, Software, Commercialization, and Emerging Directions / [ed] Sharon Oviatt, Björn Schuller, Philip R. Cohen, Daniel Sonntag, Gerasimos Potamianos, Antonio Krüger, ACM Press, 2019Chapter in book (Refereed)
Place, publisher, year, edition, pages
ACM Press, 2019
National Category
Human Computer Interaction
Identifiers
urn:nbn:se:kth:diva-254650 (URN)9781970001723 (ISBN)
Note

QC 20190821

Available from: 2019-07-02 Created: 2019-07-02 Last updated: 2019-08-21Bibliographically approved
Székely, É., Henter, G. E., Beskow, J. & Gustafson, J. (2019). Off the cuff: Exploring extemporaneous speech delivery with TTS. In: : . Paper presented at Interspeech.
Open this publication in new window or tab >>Off the cuff: Exploring extemporaneous speech delivery with TTS
2019 (English)Conference paper, Published paper (Refereed)
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-261691 (URN)
Conference
Interspeech
Note

QC 20191011

Available from: 2019-10-10 Created: 2019-10-10 Last updated: 2019-10-11Bibliographically approved
Stefanov, K. (2019). Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition. IEEE Transactions on Cognitive and Developmental Systems
Open this publication in new window or tab >>Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition
2019 (English)In: IEEE Transactions on Cognitive and Developmental Systems, ISSN 2379-8920Article in journal (Refereed) Published
Abstract [en]

This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.

National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-260126 (URN)10.1109/TCDS.2019.2927941 (DOI)2-s2.0-85069908129 (Scopus ID)
Note

QC 20191011

Available from: 2019-09-25 Created: 2019-09-25 Last updated: 2019-10-11Bibliographically approved
Székely, É., Henter, G. E., Beskow, J. & Gustafson, J. (2019). Spontaneous conversational speech synthesis from found data. In: : . Paper presented at Interspeech.
Open this publication in new window or tab >>Spontaneous conversational speech synthesis from found data
2019 (English)Conference paper, Published paper (Refereed)
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-261689 (URN)
Conference
Interspeech
Note

QC 20191011

Available from: 2019-10-10 Created: 2019-10-10 Last updated: 2019-10-11Bibliographically approved
Kontogiorgos, D., Avramova, V., Alexanderson, S., Jonell, P., Oertel, C., Beskow, J., . . . Gustafson, J. (2018). A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018): . Paper presented at International Conference on Language Resources and Evaluation (LREC 2018) (pp. 119-127). Paris
Open this publication in new window or tab >>A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction
Show others...
2018 (English)In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 119-127Conference paper, Published paper (Refereed)
Abstract [en]

In this paper we present a corpus of multiparty situated interaction where participants collaborated on moving virtual objects on a large touch screen. A moderator facilitated the discussion and directed the interaction. The corpus contains recordings of a variety of multimodal data, in that we captured speech, eye gaze and gesture data using a multisensory setup (wearable eye trackers, motion capture and audio/video). Furthermore, in the description of the multimodal corpus, we investigate four different types of social gaze: referential gaze, joint attention, mutual gaze and gaze aversion by both perspectives of a speaker and a listener. We annotated the groups’ object references during object manipulation tasks and analysed the group’s proportional referential eye-gaze with regards to the referent object. When investigating the distributions of gaze during and before referring expressions we could corroborate the differences in time between speakers’ and listeners’ eye gaze found in earlier studies. This corpus is of particular interest to researchers who are interested in social eye-gaze patterns in turn-taking and referring language in situated multi-party interaction.

Place, publisher, year, edition, pages
Paris: , 2018
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-230238 (URN)2-s2.0-85059891166 (Scopus ID)979-10-95546-00-9 (ISBN)
Conference
International Conference on Language Resources and Evaluation (LREC 2018)
Note

QC 20180614

Available from: 2018-06-13 Created: 2018-06-13 Last updated: 2019-02-19Bibliographically approved
Jonell, P., Oertel, C., Kontogiorgos, D., Beskow, J. & Gustafson, J. (2018). Crowdsourced Multimodal Corpora Collection Tool. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018): . Paper presented at The Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 728-734). Paris
Open this publication in new window or tab >>Crowdsourced Multimodal Corpora Collection Tool
Show others...
2018 (English)In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 728-734Conference paper, Published paper (Refereed)
Abstract [en]

In recent years, more and more multimodal corpora have been created. To our knowledge there is no publicly available tool which allows for acquiring controlled multimodal data of people in a rapid and scalable fashion. We therefore are proposing (1) a novel tool which will enable researchers to rapidly gather large amounts of multimodal data spanning a wide demographic range, and (2) an example of how we used this tool for corpus collection of our "Attentive listener'' multimodal corpus. The code is released under an Apache License 2.0 and available as an open-source repository, which can be found at https://github.com/kth-social-robotics/multimodal-crowdsourcing-tool. This tool will allow researchers to set-up their own multimodal data collection system quickly and create their own multimodal corpora. Finally, this paper provides a discussion about the advantages and disadvantages with a crowd-sourced data collection tool, especially in comparison to a lab recorded corpora.

Place, publisher, year, edition, pages
Paris: , 2018
National Category
Engineering and Technology
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-230236 (URN)979-10-95546-00-9 (ISBN)
Conference
The Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Note

QC 20180618

Available from: 2018-06-13 Created: 2018-06-13 Last updated: 2018-11-13Bibliographically approved
Vögel, H.-J. -., Sü, C., Hubregtsen, T., Ghaderi, V., Chadowitz, R., André, E., . . . Müller, S. (2018). Emotion-awareness for intelligent vehicle assistants: A research agenda. In: Proceedings - International Conference on Software Engineering: . Paper presented at 1st ACM/IEEE International Workshop on Software Engineering for AI in Autonomous Systems, SEFAIAS 2018, Gothenburg, Sweden, 28 May 2018 (pp. 11-15). IEEE Computer Society
Open this publication in new window or tab >>Emotion-awareness for intelligent vehicle assistants: A research agenda
Show others...
2018 (English)In: Proceedings - International Conference on Software Engineering, IEEE Computer Society, 2018, p. 11-15Conference paper, Published paper (Refereed)
Abstract [en]

EVA1 is describing a new class of emotion-aware autonomous systems delivering intelligent personal assistant functionalities. EVA requires a multi-disciplinary approach, combining a number of critical building blocks into a cybernetics systems/software architecture: emotion aware systems and algorithms, multimodal interaction design, cognitive modelling, decision making and recommender systems, emotion sensing as feedback for learning, and distributed (edge) computing delivering cognitive services.

Place, publisher, year, edition, pages
IEEE Computer Society, 2018
Keywords
Cognitive models and proactive recommendations, Emotion awareness, Emotional state analysis, Intelligent assistants, IoT, Multi-modal interaction design, Neuromorphic emotion sensing, Privacy preserving machine learning
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-233715 (URN)10.1145/3194085.3194094 (DOI)000454722500003 ()2-s2.0-85051127136 (Scopus ID)9781450357395 (ISBN)
Conference
1st ACM/IEEE International Workshop on Software Engineering for AI in Autonomous Systems, SEFAIAS 2018, Gothenburg, Sweden, 28 May 2018
Note

QC 20180831

Available from: 2018-08-31 Created: 2018-08-31 Last updated: 2019-01-18Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-1399-6604

Search in DiVA

Show all publications