kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 35) Show all publications
Deichler, A., Mehta, S., Alexanderson, S. & Beskow, J. (2023). Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation. In: PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023: . Paper presented at 25th International Conference on Multimodal Interaction (ICMI), OCT 09-13, 2023, Sorbonne Univ, Paris, FRANCE (pp. 755-762). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
2023 (English)In: PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, Association for Computing Machinery (ACM) , 2023, p. 755-762Conference paper, Published paper (Refereed)
Abstract [en]

This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing difusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the difusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
Keywords
gesture generation, motion synthesis, difusion models, contrastive pre-training, semantic gestures
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-343773 (URN)10.1145/3577190.3616117 (DOI)001147764700093 ()2-s2.0-85170496681 (Scopus ID)
Conference
25th International Conference on Multimodal Interaction (ICMI), OCT 09-13, 2023, Sorbonne Univ, Paris, FRANCE
Note

Part of proceedings ISBN 979-8-4007-0055-2

QC 20240222

Available from: 2024-02-22 Created: 2024-02-22 Last updated: 2024-03-05Bibliographically approved
Deichler, A., Wang, S., Alexanderson, S. & Beskow, J. (2023). Learning to generate pointing gestures in situated embodied conversational agents. Frontiers in Robotics and AI, 10, Article ID 1110534.
Open this publication in new window or tab >>Learning to generate pointing gestures in situated embodied conversational agents
2023 (English)In: Frontiers in Robotics and AI, E-ISSN 2296-9144, Vol. 10, article id 1110534Article in journal (Refereed) Published
Abstract [en]

One of the main goals of robotics and intelligent agent research is to enable them to communicate with humans in physically situated settings. Human communication consists of both verbal and non-verbal modes. Recent studies in enabling communication for intelligent agents have focused on verbal modes, i.e., language and speech. However, in a situated setting the non-verbal mode is crucial for an agent to adapt flexible communication strategies. In this work, we focus on learning to generate non-verbal communicative expressions in situated embodied interactive agents. Specifically, we show that an agent can learn pointing gestures in a physically simulated environment through a combination of imitation and reinforcement learning that achieves high motion naturalness and high referential accuracy. We compared our proposed system against several baselines in both subjective and objective evaluations. The subjective evaluation is done in a virtual reality setting where an embodied referential game is played between the user and the agent in a shared 3D space, a setup that fully assesses the communicative capabilities of the generated gestures. The evaluations show that our model achieves a higher level of referential accuracy and motion naturalness compared to a state-of-the-art supervised learning motion synthesis model, showing the promise of our proposed system that combines imitation and reinforcement learning for generating communicative gestures. Additionally, our system is robust in a physically-simulated environment thus has the potential of being applied to robots.

Place, publisher, year, edition, pages
Frontiers Media SA, 2023
Keywords
reinforcement learning, imitation learning, non-verbal communication, embodied interactive agents, gesture generation, physics-aware machine learning
National Category
Human Computer Interaction
Identifiers
urn:nbn:se:kth:diva-326625 (URN)10.3389/frobt.2023.1110534 (DOI)000970385800001 ()37064574 (PubMedID)2-s2.0-85153351800 (Scopus ID)
Note

QC 20230508

Available from: 2023-05-08 Created: 2023-05-08 Last updated: 2023-05-08Bibliographically approved
Alexanderson, S., Nagy, R., Beskow, J. & Henter, G. E. (2023). Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models. ACM Transactions on Graphics, 42(4), Article ID 44.
Open this publication in new window or tab >>Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models
2023 (English)In: ACM Transactions on Graphics, ISSN 0730-0301, E-ISSN 1557-7368, Vol. 42, no 4, article id 44Article in journal (Refereed) Published
Abstract [en]

Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
Keywords
conformers, dance, diffusion models, ensemble models, generative models, gestures, guided interpolation, locomotion, machine learning, product of experts
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-335345 (URN)10.1145/3592458 (DOI)001044671300010 ()2-s2.0-85166332883 (Scopus ID)
Note

QC 20230907

Available from: 2023-09-07 Created: 2023-09-07 Last updated: 2023-09-22Bibliographically approved
Deichler, A., Wang, S., Alexanderson, S. & Beskow, J. (2022). Towards Context-Aware Human-like Pointing Gestures with RL Motion Imitation. In: : . Paper presented at Context-Awareness in Human-Robot Interaction: Approaches and Challenges, workshop at 2022 ACM/IEEE International Conference on Human-Robot Interaction (pp. 2022).
Open this publication in new window or tab >>Towards Context-Aware Human-like Pointing Gestures with RL Motion Imitation
2022 (English)Conference paper, Oral presentation with published abstract (Refereed)
Abstract [en]

Pointing is an important mode of interaction with robots. While large amounts of prior studies focus on recognition of human pointing, there is a lack of investigation into generating context-aware human-like pointing gestures, a shortcoming we hope to address. We first collect a rich dataset of human pointing gestures and corresponding pointing target locations with accurate motion capture. Analysis of the dataset shows that it contains various pointing styles, handedness, and well-distributed target positions in surrounding 3D space in both single-target pointing scenario and two-target point-and-place.We then train reinforcement learning (RL) control policies in physically realistic simulation to imitate the pointing motion in the dataset while maximizing pointing precision reward.We show that our RL motion imitation setup allows models to learn human-like pointing dynamics while maximizing task reward (pointing precision). This is promising for incorporating additional context in the form of task reward to enable flexible context-aware pointing behaviors in a physically realistic environment while retaining human-likeness in pointing motion dynamics.

Keywords
motion generation, reinforcement learning, referring actions, pointing gestures, human-robot interaction, motion capture
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-313480 (URN)
Conference
Context-Awareness in Human-Robot Interaction: Approaches and Challenges, workshop at 2022 ACM/IEEE International Conference on Human-Robot Interaction
Note

QC 20220607

Available from: 2022-06-03 Created: 2022-06-03 Last updated: 2022-06-25Bibliographically approved
Wang, S., Alexanderson, S., Gustafsson, J., Beskow, J., Henter, G. E. & Székely, É. (2021). Integrated Speech and Gesture Synthesis. In: ICMI 2021 - Proceedings of the 2021 International Conference on Multimodal Interaction: . Paper presented at ICMI '21: International Conference on Multimodal Interaction, Montréal, QC, Canada, October 18-22, 2021 (pp. 177-185). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Integrated Speech and Gesture Synthesis
Show others...
2021 (English)In: ICMI 2021 - Proceedings of the 2021 International Conference on Multimodal Interaction, Association for Computing Machinery (ACM) , 2021, p. 177-185Conference paper, Published paper (Refereed)
Abstract [en]

Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline. This can lead to modeling inefficiencies and may introduce inconsistencies that limit the achievable naturalness. We propose to instead synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG). We also propose a set of models modified from state-of-the-art neural speech-synthesis engines to achieve this goal. We evaluate the models in three carefully-designed user studies, two of which evaluate the synthesized speech and gesture in isolation, plus a combined study that evaluates the models like they will be used in real-world applications - speech and gesture presented together. The results show that participants rate one of the proposed integrated synthesis models as being as good as the state-of-the-art pipeline system we compare against, in all three tests. The model is able to achieve this with faster synthesis time and greatly reduced parameter count compared to the pipeline system, illustrating some of the potential benefits of treating speech and gesture synthesis together as a single, unified problem.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2021
Keywords
gesture generation, neural networks, speech synthesis, Piping systems, Neural-networks, Pipeline systems, Research applications, Research communities, Simple system, Single models, State of the art, System level pipelines, Text to speech, Water pipelines
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-313183 (URN)10.1145/3462244.3479914 (DOI)2-s2.0-85118992736 (Scopus ID)
Conference
ICMI '21: International Conference on Multimodal Interaction, Montréal, QC, Canada, October 18-22, 2021
Note

Part of proceedings ISBN 9781450384810

QC 20220602

Available from: 2022-06-02 Created: 2022-06-02 Last updated: 2022-06-25Bibliographically approved
Valle-Perez, G., Henter, G. E., Beskow, J., Holzapfel, A., Oudeyer, P.-Y. & Alexanderson, S. (2021). Transflower: probabilistic autoregressive dance generation with multimodal attention. ACM Transactions on Graphics, 40(6), Article ID 195.
Open this publication in new window or tab >>Transflower: probabilistic autoregressive dance generation with multimodal attention
Show others...
2021 (English)In: ACM Transactions on Graphics, ISSN 0730-0301, E-ISSN 1557-7368, Vol. 40, no 6, article id 195Article in journal (Refereed) Published
Abstract [en]

Dance requires skillful composition of complex movements that follow rhythmic, tonal and timbral features of music. Formally, generating dance conditioned on a piece of music can be expressed as a problem of modelling a high-dimensional continuous motion signal, conditioned on an audio signal. In this work we make two contributions to tackle this problem. First, we present a novel probabilistic autoregressive architecture that models the distribution over future poses with a normalizing flow conditioned on previous poses as well as music context, using a multimodal transformer encoder. Second, we introduce the currently largest 3D dance-motion dataset, obtained with a variety of motion-capture technologies, and including both professional and casual dancers. Using this dataset, we compare our new model against two baselines, via objective metrics and a user study, and show that both the ability to model a probability distribution, as well as being able to attend over a large motion and music context are necessary to produce interesting, diverse, and realistic dance that matches the music.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2021
Keywords
Generative models, machine learning, normalising flows, Glow, transformers, dance
National Category
Computer Vision and Robotics (Autonomous Systems) Computer Sciences Signal Processing
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-307028 (URN)10.1145/3478513.3480570 (DOI)000729846700001 ()2-s2.0-85125127739 (Scopus ID)
Funder
Swedish Research Council, 2018-05409Swedish Research Council, 2019-03694Knut and Alice Wallenberg Foundation, WASPMarianne and Marcus Wallenberg Foundation, 2020.0102
Note

QC 20220520

Available from: 2022-01-11 Created: 2022-01-11 Last updated: 2023-06-08Bibliographically approved
Kammerlander, R. K., Abelho Pereira, A. T. & Alexanderson, S. (2021). Using Virtual Reality to Support Acting in Motion Capture with Differently Scaled Characters. In: 2021 IEEE VIRTUAL REALITY AND 3D USER INTERFACES (VR): . Paper presented at 2021 IEEE VIRTUAL REALITY AND 3D USER INTERFACES (VR) (pp. 402-410). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Using Virtual Reality to Support Acting in Motion Capture with Differently Scaled Characters
2021 (English)In: 2021 IEEE VIRTUAL REALITY AND 3D USER INTERFACES (VR), Institute of Electrical and Electronics Engineers (IEEE) , 2021, p. 402-410Conference paper, Published paper (Refereed)
Abstract [en]

Motion capture is a well-established technology for capturing actors' movements and performances within the entertainment industry. Many actors, however, witness the poor acting conditions associated with such recordings. Instead of detailed sets, costumes and props, they are forced to play in empty spaces wearing tight suits. Often, their co-actors will be imaginary, replaced by placeholder props, or they would be out of scale with their virtual counterparts. These problems do not only affect acting, they also cause an abundance of laborious post-processing clean-up work. To solve these challenges, we propose using a combination of virtual reality and motion capture technology to bring differently proportioned virtual characters into a shared collaborative virtual environment. A within-subjects user study with trained actors showed that our proposed platform enhances their feelings of body ownership and immersion. This in turn changed actors' performances which narrowed the gap between virtual performances and final intended animations.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2021
Keywords
Collaborative virtual production, acting, motion capture, body ownership, presence
National Category
Human Computer Interaction Other Computer and Information Science
Identifiers
urn:nbn:se:kth:diva-300223 (URN)10.1109/VR50410.2021.00063 (DOI)000675593600044 ()2-s2.0-85106491123 (Scopus ID)
Conference
2021 IEEE VIRTUAL REALITY AND 3D USER INTERFACES (VR)
Note

QC 20210830

Available from: 2021-08-30 Created: 2021-08-30 Last updated: 2022-06-25Bibliographically approved
Alexanderson, S., Székely, É., Henter, G. E., Kucherenko, T. & Beskow, J. (2020). Generating coherent spontaneous speech and gesture from text. In: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, IVA 2020: . Paper presented at 20th ACM International Conference on Intelligent Virtual Agents, IVA 2020, 20 October 2020 through 22 October 2020. Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Generating coherent spontaneous speech and gesture from text
Show others...
2020 (English)In: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, IVA 2020, Association for Computing Machinery (ACM) , 2020Conference paper, Published paper (Refereed)
Abstract [en]

Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthesise vivid and lifelike speech-driven 3D gesticulation. In this paper, we put these two state-of-the-art technologies together in a coherent fashion for the first time. Concretely, we demonstrate a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input. In contrast to previous approaches for joint speech-and-gesture generation, we generate full-body gestures from speech synthesis trained on recordings of spontaneous speech from the same person as the motion-capture data. We illustrate our results by visualising gesture spaces and textspeech-gesture alignments, and through a demonstration video.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2020
Keywords
Gesture synthesis, neural networks, text-to-speech, Audio acoustics, Audio systems, Intelligent virtual agents, Motion capture, Speech synthesis, Human communications, Motion capture data, Motion generation, Non-verbal information, Proof of concept, Source material, Spontaneous speech, Text-to-speech system, Speech communication
National Category
General Language Studies and Linguistics Human Computer Interaction Computer Sciences
Identifiers
urn:nbn:se:kth:diva-291568 (URN)10.1145/3383652.3423874 (DOI)000728153600016 ()2-s2.0-85096981583 (Scopus ID)
Conference
20th ACM International Conference on Intelligent Virtual Agents, IVA 2020, 20 October 2020 through 22 October 2020
Note

QC 20210330

 conference ISBN 9781450375863

Available from: 2021-03-30 Created: 2021-03-30 Last updated: 2022-06-25Bibliographically approved
Kucherenko, T., Jonell, P., van Waveren, S., Henter, G. E., Alexanderson, S., Leite, I. & Kjellström, H. (2020). Gesticulator: A framework for semantically-aware speech-driven gesture generation. In: ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction: . Paper presented at ICMI '20: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION Virtual Event Netherlands October 25 - 29, 2020. Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Gesticulator: A framework for semantically-aware speech-driven gesture generation
Show others...
2020 (English)In: ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction, Association for Computing Machinery (ACM) , 2020Conference paper, Published paper (Refereed)
Abstract [en]

During speech, people spontaneously gesticulate, which plays akey role in conveying information. Similarly, realistic co-speechgestures are crucial to enable natural and smooth interactions withsocial agents. Current end-to-end co-speech gesture generationsystems use a single modality for representing speech: either au-dio or text. These systems are therefore confined to producingeither acoustically-linked beat gestures or semantically-linked ges-ticulation (e.g., raising a hand when saying “high”): they cannotappropriately learn to generate both gesture types. We present amodel designed to produce arbitrary beat and semantic gesturestogether. Our deep-learning based model takes both acoustic andsemantic representations of speech as input, and generates gesturesas a sequence of joint angle rotations as output. The resulting ges-tures can be applied to both virtual agents and humanoid robots.Subjective and objective evaluations confirm the success of ourapproach. The code and video are available at the project page svito-zar.github.io/gesticula

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2020
Keywords
Gesture generation; virtual agents; socially intelligent systems; co-speech gestures; multi-modal interaction; deep learning
National Category
Human Computer Interaction
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-286282 (URN)10.1145/3382507.3418815 (DOI)2-s2.0-85096710861 (Scopus ID)
Conference
ICMI '20: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION Virtual Event Netherlands October 25 - 29, 2020
Projects
EACare
Funder
Swedish Foundation for Strategic Research , RIT15-0107
Note

ICMI 2020 Best Paper Award

Part of Proceedings: ISBN 978-1-4503-7581-8

QC 20211109

Available from: 2020-11-24 Created: 2020-11-24 Last updated: 2022-06-25Bibliographically approved
Henter, G. E., Alexanderson, S. & Beskow, J. (2020). MoGlow: Probabilistic and controllable motion synthesis using normalising flows. Paper presented at The 13th ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia, Online event, December 4–13, 2020. ACM Transactions on Graphics, 39(6), 1-14, Article ID 236.
Open this publication in new window or tab >>MoGlow: Probabilistic and controllable motion synthesis using normalising flows
2020 (English)In: ACM Transactions on Graphics, ISSN 0730-0301, E-ISSN 1557-7368, Vol. 39, no 6, p. 1-14, article id 236Article in journal (Refereed) Published
Abstract [en]

Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motion-data models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood, unlike GANs or VAEs. Our proposed model is autoregressive and uses LSTMs to enable arbitrarily long time-dependencies. Importantly, is is also causal, meaning that each pose in the output sequence is generated without access to poses or control inputs from future time steps; this absence of algorithmic latency is important for interactive applications with real-time motion control. The approach can in principle be applied to any type of motion since it does not make restrictive, task-specific assumptions regarding the motion or the character morphology. We evaluate the models on motion-capture datasets of human and quadruped locomotion. Objective and subjective results show that randomly-sampled motion from the proposed method outperforms task-agnostic baselines and attains a motion quality close to recorded motion capture.

Place, publisher, year, edition, pages
New York, NY, USA: Association for Computing Machinery (ACM), 2020
Keywords
Generative models, machine learning, normalising flows, Glow, footstep analysis, data dropout
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-282300 (URN)10.1145/3414685.3417836 (DOI)000595589100076 ()2-s2.0-85096681707 (Scopus ID)
Conference
The 13th ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia, Online event, December 4–13, 2020
Projects
VR proj. 2018-05409 (StyleBot)SSF no. RIT15-0107 (EACare)Wallenberg AI, Autonomous Systems and Software Program (WASP)
Funder
Swedish Research Council, 2018-05409Swedish Foundation for Strategic Research, RIT15-0107Knut and Alice Wallenberg Foundation, WASP
Note

QC 20200929

Available from: 2020-09-29 Created: 2020-09-29 Last updated: 2024-02-19Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-7801-7617

Search in DiVA

Show all publications