Change search
Link to record
Permanent link

Direct link
BETA
Publications (10 of 21) Show all publications
Alexanderson, S., Henter, G. E., Kucherenko, T. & Beskow, J. (2020). Style-Controllable Speech-Driven Gesture SynthesisUsing Normalising Flows. In: : . Paper presented at EUROGRAPHICS 2020.
Open this publication in new window or tab >>Style-Controllable Speech-Driven Gesture SynthesisUsing Normalising Flows
2020 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Automatic synthesis of realistic gestures promises to transform the fields of animation, avatars and communicative agents. In off-line applications, novel tools can alter the role of an animator to that of a director, who provides only high-level input for the desired animation; a learned network then translates these instructions into an appropriate sequence of body poses. In interactive scenarios, systems for generating natural animations on the fly are key to achieving believable and relatable characters. In this paper we address some of the core issues towards these ends. By adapting a deep learning-based motion synthesis method called MoGlow, we propose a new generative model for generating state-of-the-art realistic speech-driven gesticulation. Owing to the probabilistic nature of the approach, our model can produce a battery of different, yet plausible, gestures given the same input speech signal. Just like humans, this gives a rich natural variation of motion. We additionally demonstrate the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent. Such control can be leveraged to convey a desired character personality or mood. We achieve all this without any manual annotation of the data. User studies evaluating upper-body gesticulation confirm that the generated motions are natural and well match the input speech. Our method scores above all prior systems and baselines on these measures, and comes close to the ratings of the original recorded motions. We furthermore find that we can accurately control gesticulation styles without unnecessarily compromising perceived naturalness. Finally, we also demonstrate an application of the same method to full-body gesticulation, including the synthesis of stepping motion and stance.

National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-268363 (URN)
Conference
EUROGRAPHICS 2020
Note

QCR 20200513

Available from: 2020-02-18 Created: 2020-02-18 Last updated: 2020-05-13Bibliographically approved
Henter, G. E., Alexanderson, S. & Beskow, J. (2019). Moglow: Probabilistic and controllable motion synthesis using normalising flows. arXiv preprint arXiv:1905.06598
Open this publication in new window or tab >>Moglow: Probabilistic and controllable motion synthesis using normalising flows
2019 (English)In: arXiv preprint arXiv:1905.06598Article in journal (Other academic) Published
Abstract [en]

Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motiondata models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood, unlike GANs or VAEs. Our proposed model is autoregressive and uses LSTMs to enable arbitrarily long time-dependencies. Importantly, is is also causal, meaning that each pose in the output sequence is generated without access to poses or control inputs from future time steps; this absence of algorithmic latency is important for interactive applications with real-time motion control. The approach can in principle be applied to any type of motion since it does not make restrictive assumptions such as the motion being cyclic in nature. We evaluate the models on motion-capture datasets of human and quadruped locomotion. Objective and subjective results show that randomly-sampled motion from the proposed method attains a motion quality close to recorded motion capture for both humans and animals.

National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:kth:diva-268348 (URN)
Note

QC 20200512

Available from: 2020-02-18 Created: 2020-02-18 Last updated: 2020-05-12Bibliographically approved
Kontogiorgos, D., Avramova, V., Alexanderson, S., Jonell, P., Oertel, C., Beskow, J., . . . Gustafson, J. (2018). A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018): . Paper presented at International Conference on Language Resources and Evaluation (LREC 2018) (pp. 119-127). Paris
Open this publication in new window or tab >>A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction
Show others...
2018 (English)In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 119-127Conference paper, Published paper (Refereed)
Abstract [en]

In this paper we present a corpus of multiparty situated interaction where participants collaborated on moving virtual objects on a large touch screen. A moderator facilitated the discussion and directed the interaction. The corpus contains recordings of a variety of multimodal data, in that we captured speech, eye gaze and gesture data using a multisensory setup (wearable eye trackers, motion capture and audio/video). Furthermore, in the description of the multimodal corpus, we investigate four different types of social gaze: referential gaze, joint attention, mutual gaze and gaze aversion by both perspectives of a speaker and a listener. We annotated the groups’ object references during object manipulation tasks and analysed the group’s proportional referential eye-gaze with regards to the referent object. When investigating the distributions of gaze during and before referring expressions we could corroborate the differences in time between speakers’ and listeners’ eye gaze found in earlier studies. This corpus is of particular interest to researchers who are interested in social eye-gaze patterns in turn-taking and referring language in situated multi-party interaction.

Place, publisher, year, edition, pages
Paris: , 2018
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-230238 (URN)2-s2.0-85059891166 (Scopus ID)979-10-95546-00-9 (ISBN)
Conference
International Conference on Language Resources and Evaluation (LREC 2018)
Note

QC 20180614

Available from: 2018-06-13 Created: 2018-06-13 Last updated: 2019-02-19Bibliographically approved
Frid, E., Bresin, R. & Alexanderson, S. (2018). Perception of Mechanical Sounds Inherent to Expressive Gestures of a NAO Robot - Implications for Movement Sonification of Humanoids. In: Anastasia Georgaki and Areti Andreopoulou (Ed.), Proceedings of the 15th Sound and Music Computing Conference: . Paper presented at Sound and Music Computing. Limassol, Cyprus
Open this publication in new window or tab >>Perception of Mechanical Sounds Inherent to Expressive Gestures of a NAO Robot - Implications for Movement Sonification of Humanoids
2018 (English)In: Proceedings of the 15th Sound and Music Computing Conference / [ed] Anastasia Georgaki and Areti Andreopoulou, Limassol, Cyprus, 2018Conference paper, Published paper (Refereed)
Abstract [en]

In this paper we present a pilot study carried out within the project SONAO. The SONAO project aims to compen- sate for limitations in robot communicative channels with an increased clarity of Non-Verbal Communication (NVC) through expressive gestures and non-verbal sounds. More specifically, the purpose of the project is to use move- ment sonification of expressive robot gestures to improve Human-Robot Interaction (HRI). The pilot study described in this paper focuses on mechanical robot sounds, i.e. sounds that have not been specifically designed for HRI but are inherent to robot movement. Results indicated a low correspondence between perceptual ratings of mechanical robot sounds and emotions communicated through ges- tures. In general, the mechanical sounds themselves ap- peared not to carry much emotional information compared to video stimuli of expressive gestures. However, some mechanical sounds did communicate certain emotions, e.g. frustration. In general, the sounds appeared to commu- nicate arousal more effectively than valence. We discuss potential issues and possibilities for the sonification of ex- pressive robot gestures and the role of mechanical sounds in such a context. Emphasis is put on the need to mask or alter sounds inherent to robot movement, using for exam- ple blended sonification.

Place, publisher, year, edition, pages
Limassol, Cyprus: , 2018
Keywords
sonification, interactive sonification, auditory feedback, sound perception, expressive gestures, NAO robot, humanoid, sound and music computing
National Category
Media and Communication Technology Human Computer Interaction Computer Vision and Robotics (Autonomous Systems) Other Computer and Information Science Media Engineering
Research subject
Media Technology; Human-computer Interaction
Identifiers
urn:nbn:se:kth:diva-239167 (URN)10.5281/zenodo.1422499 (DOI)978-9963-697-30-4 (ISBN)
Conference
Sound and Music Computing
Projects
SONAO
Funder
Swedish Research Council, 2017-03979
Note

QC 20181211

Available from: 2018-11-19 Created: 2018-11-19 Last updated: 2019-06-26Bibliographically approved
Vijayan, A. E., Alexanderson, S., Beskow, J. & Leite, I. (2018). Using Constrained Optimization for Real-Time Synchronization of Verbal and Nonverbal Robot Behavior. In: 2018 IEEE International Conference on Robotics and Automation (ICRA): . Paper presented at 2018 IEEE International Conference on Robotics and Automation (ICRA) (pp. 1955-1961).
Open this publication in new window or tab >>Using Constrained Optimization for Real-Time Synchronization of Verbal and Nonverbal Robot Behavior
2018 (English)In: 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, p. 1955-1961Conference paper, Published paper (Refereed)
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:kth:diva-268355 (URN)
Conference
2018 IEEE International Conference on Robotics and Automation (ICRA)
Available from: 2020-02-18 Created: 2020-02-18 Last updated: 2020-03-04
Vijayan, A. E., Alexanderson, S., Beskow, J. & Leite, I. (2018). Using Constrained Optimization for Real-Time Synchronization of Verbal and Nonverbal Robot Behavior. In: 2018 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA): . Paper presented at IEEE International Conference on Robotics and Automation (ICRA), MAY 21-25, 2018, Brisbane, AUSTRALIA (pp. 1955-1961). IEEE Computer Society
Open this publication in new window or tab >>Using Constrained Optimization for Real-Time Synchronization of Verbal and Nonverbal Robot Behavior
2018 (English)In: 2018 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), IEEE Computer Society, 2018, p. 1955-1961Conference paper, Published paper (Refereed)
Abstract [en]

Most of the motion re-targeting techniques are grounded on virtual character animation research, which means that they typically assume that the target embodiment has unconstrained joint angular velocities. However, because robots often do have such constraints, traditional re-targeting approaches can originate irregular delays in the robot motion. With the goal of ensuring synchronization between verbal and nonverbal behavior, this paper proposes an optimization framework for processing re-targeted motion sequences that addresses constraints such as joint angle and angular velocities. The proposed framework was evaluated on a humanoid robot using both objective and subjective metrics. While the analysis of the joint motion trajectories provides evidence that our framework successfully performs the desired modifications to ensure verbal and nonverbal behavior synchronization, results from a perceptual study showed that participants found the robot motion generated by our method more natural, elegant and lifelike than a control condition.

Place, publisher, year, edition, pages
IEEE Computer Society, 2018
Series
IEEE International Conference on Robotics and Automation ICRA, ISSN 1050-4729
National Category
Robotics
Identifiers
urn:nbn:se:kth:diva-237162 (URN)10.1109/ICRA.2018.8462828 (DOI)000446394501077 ()2-s2.0-85063159854 (Scopus ID)978-1-5386-3081-5 (ISBN)
Conference
IEEE International Conference on Robotics and Automation (ICRA), MAY 21-25, 2018, Brisbane, AUSTRALIA
Note

QC 20181024

Available from: 2018-10-24 Created: 2018-10-24 Last updated: 2020-03-05Bibliographically approved
Karipidou, K., Ahnlund, J., Friberg, A., Alexanderson, S. & Kjellström, H. (2017). Computer Analysis of Sentiment Interpretation in Musical Conducting. In: Proceedings - 12th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2017: . Paper presented at 12th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2017, Washington, United States, 30 May 2017 through 3 June 2017 (pp. 400-405). IEEE, Article ID 7961769.
Open this publication in new window or tab >>Computer Analysis of Sentiment Interpretation in Musical Conducting
Show others...
2017 (English)In: Proceedings - 12th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2017, IEEE, 2017, p. 400-405, article id 7961769Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents a unique dataset consisting of 20 recordings of the same musical piece, conducted with 4 different musical intentions in mind. The upper body and baton motion of a professional conductor was recorded, as well as the sound of each instrument in a professional string quartet following the conductor. The dataset is made available for benchmarking of motion recognition algorithms. An HMM-based emotion intent classification method is trained with subsets of the data, and classification of other subsets of the data show firstly that the motion of the baton communicates energetic intention to a high degree, secondly, that the conductor’s torso, head and other arm conveys calm intention to a high degree, and thirdly, that positive vs negative sentiments are communicated to a high degree through other channels than the body and baton motion – most probably, through facial expression and muscle tension conveyed through articulated hand and finger motion. The long-term goal of this work is to develop a computer model of the entire conductor-orchestra communication pro- cess; the studies presented here indicate that computer modeling of the conductor-orchestra communication is feasible.

Place, publisher, year, edition, pages
IEEE, 2017
National Category
Computer and Information Sciences
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-208886 (URN)10.1109/FG.2017.57 (DOI)000414287400054 ()2-s2.0-85026288976 (Scopus ID)9781509040230 (ISBN)
Conference
12th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2017, Washington, United States, 30 May 2017 through 3 June 2017
Note

QC 20170616

Available from: 2017-06-12 Created: 2017-06-12 Last updated: 2018-09-13Bibliographically approved
Alexanderson, S., House, D. & Beskow, J. (2016). Automatic annotation of gestural units in spontaneous face-to-face interaction. In: MA3HMI 2016 - Proceedings of the Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction: . Paper presented at 2016 Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, MA3HMI 2016, 12 November 2016 through 16 November 2016 (pp. 15-19).
Open this publication in new window or tab >>Automatic annotation of gestural units in spontaneous face-to-face interaction
2016 (English)In: MA3HMI 2016 - Proceedings of the Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, 2016, p. 15-19Conference paper, Published paper (Refereed)
Abstract [en]

Speech and gesture co-occur in spontaneous dialogue in a highly complex fashion. There is a large variability in the motion that people exhibit during a dialogue, and different kinds of motion occur during different states of the interaction. A wide range of multimodal interface applications, for example in the fields of virtual agents or social robots, can be envisioned where it is important to be able to automatically identify gestures that carry information and discriminate them from other types of motion. While it is easy for a human to distinguish and segment manual gestures from a flow of multimodal information, the same task is not trivial to perform for a machine. In this paper we present a method to automatically segment and label gestural units from a stream of 3D motion capture data. The gestural flow is modeled with a 2-level Hierarchical Hidden Markov Model (HHMM) where the sub-states correspond to gesture phases. The model is trained based on labels of complete gesture units and self-adaptive manipulators. The model is tested and validated on two datasets differing in genre and in method of capturing motion, and outperforms a state-of-the-art SVM classifier on a publicly available dataset.

Keywords
Gesture recognition, Motion capture, Spontaneous dialogue, Hidden Markov models, Man machine systems, Markov processes, Online systems, 3D motion capture, Automatic annotation, Face-to-face interaction, Hierarchical hidden markov models, Multi-modal information, Multi-modal interfaces, Classification (of information)
National Category
Robotics
Identifiers
urn:nbn:se:kth:diva-202135 (URN)10.1145/3011263.3011268 (DOI)2-s2.0-85003571594 (Scopus ID)9781450345620 (ISBN)
Conference
2016 Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, MA3HMI 2016, 12 November 2016 through 16 November 2016
Funder
Swedish Research Council, 2010-4646
Note

Funding text: The work reported here is carried out within the projects: "Timing of intonation and gestures in spoken communication," (P12-0634:1) funded by the Bank of Sweden Tercentenary Foundation, and "Large-scale massively multimodal modelling of non-verbal behaviour in spontaneous dialogue," (VR 2010-4646) funded by Swedish Research Council.

Available from: 2017-03-13 Created: 2017-03-13 Last updated: 2017-11-24Bibliographically approved
Zellers, M., House, D. & Alexanderson, S. (2016). Prosody and hand gesture at turn boundaries in Swedish. In: Proceedings of the International Conference on Speech Prosody: . Paper presented at 8th Speech Prosody 2016, 31 May 2016 through 3 June 2016 (pp. 831-835). International Speech Communications Association
Open this publication in new window or tab >>Prosody and hand gesture at turn boundaries in Swedish
2016 (English)In: Proceedings of the International Conference on Speech Prosody, International Speech Communications Association , 2016, p. 831-835Conference paper, Published paper (Refereed)
Abstract [en]

In order to ensure smooth turn-taking between conversational participants, interlocutors must have ways of providing information to one another about whether they have finished speaking or intend to continue. The current work investigates Swedish speakers’ use of hand gestures in conjunction with turn change or turn hold in unrestricted, spontaneous speech. As has been reported by other researchers, we find that speakers’ gestures end before the end of speech in cases of turn change, while they may extend well beyond the end of a given speech chunk in the case of turn hold. We investigate the degree to which prosodic cues and gesture cues to turn transition in Swedish face-to-face conversation are complementary versus functioning additively. The co-occurrence of acoustic prosodic features and gesture at potential turn boundaries gives strong support for considering hand gestures as part of the prosodic system, particularly in the context of discourse-level information such as maintaining smooth turn transition.

Place, publisher, year, edition, pages
International Speech Communications Association, 2016
Keywords
Gesture, Multimodal communication, Swedish, Turn transition, Co-occurrence, Face-to-face conversation, Multimodal communications, Prosodic features, Smooth turn-taking, Spontaneous speech, Swedishs, Speech
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-195492 (URN)2-s2.0-84982980451 (Scopus ID)
Conference
8th Speech Prosody 2016, 31 May 2016 through 3 June 2016
Note

QC 20161125

Available from: 2016-11-25 Created: 2016-11-03 Last updated: 2018-01-13Bibliographically approved
House, D., Alexanderson, S. & Beskow, J. (2015). On the temporal domain of co-speech gestures: syllable, phrase or talk spurt?. In: Lundmark Svensson, M.; Ambrazaitis, G.; van de Weijer, J. (Ed.), Proceedings of Fonetik 2015: . Paper presented at Fonetik 2015, Lund (pp. 63-68).
Open this publication in new window or tab >>On the temporal domain of co-speech gestures: syllable, phrase or talk spurt?
2015 (English)In: Proceedings of Fonetik 2015 / [ed] Lundmark Svensson, M.; Ambrazaitis, G.; van de Weijer, J., 2015, p. 63-68Conference paper, Published paper (Other academic)
Abstract [en]

This study explores the use of automatic methods to detect and extract handgesture movement co-occuring with speech. Two spontaneous dyadic dialogueswere analyzed using 3D motion-capture techniques to track hand movement.Automatic speech/non-speech detection was performed on the dialogues resultingin a series of connected talk spurts for each speaker. Temporal synchrony of onsetand offset of gesture and speech was studied between the automatic hand gesturetracking and talk spurts, and compared to an earlier study of head nods andsyllable synchronization. The results indicated onset synchronization between headnods and the syllable in the short temporal domain and between the onset of longergesture units and the talk spurt in a more extended temporal domain.

National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-180407 (URN)
Conference
Fonetik 2015, Lund
Note

QC 20160216

Available from: 2016-01-13 Created: 2016-01-13 Last updated: 2018-01-10Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-7801-7617

Search in DiVA

Show all publications