kth.sePublications
Change search
Refine search result
1 - 36 of 36
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A robotic head using projected animated faces2011In: Proceedings of the International Conference on Audio-Visual Speech Processing 2011 / [ed] Salvi, G.; Beskow, J.; Engwall, O.; Al Moubayed, S., Stockholm: KTH Royal Institute of Technology, 2011, p. 71-Conference paper (Refereed)
    Abstract [en]

    This paper presents a setup which employs virtual animatedagents for robotic heads. The system uses a laser projector toproject animated faces onto a three dimensional face mask. This approach of projecting animated faces onto a three dimensional head surface as an alternative to using flat, two dimensional surfaces, eliminates several deteriorating effects and illusions that come with flat surfaces for interaction purposes, such as exclusive mutual gaze and situated and multi-partner dialogues. In addition to that, it provides robotic heads with a flexible solution for facial animation which takes into advantage the advancements of facial animation using computer graphics overmechanically controlled heads.

  • 2.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Performance, Processing and Perception of Communicative Motion for Avatars and Agents2017Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Artificial agents and avatars are designed with a large variety of face and body configurations. Some of these (such as virtual characters in films) may be highly realistic and human-like, while others (such as social robots) have considerably more limited expressive means. In both cases, human motion serves as the model and inspiration for the non-verbal behavior displayed. This thesis focuses on increasing the expressive capacities of artificial agents and avatars using two main strategies: 1) improving the automatic capturing of the most communicative areas for human communication, namely the face and the fingers, and 2) increasing communication clarity by proposing novel ways of eliciting clear and readable non-verbal behavior.

    The first part of the thesis covers automatic methods for capturing and processing motion data. In paper A, we propose a novel dual sensor method for capturing hands and fingers using optical motion capture in combination with low-cost instrumented gloves. The approach circumvents the main problems with marker-based systems and glove-based systems, and it is demonstrated and evaluated on a key-word signing avatar. In paper B, we propose a robust method for automatic labeling of sparse, non-rigid motion capture marker sets, and we evaluate it on a variety of marker configurations for finger and facial capture. In paper C, we propose an automatic method for annotating hand gestures using Hierarchical Hidden Markov Models (HHMMs).

    The second part of the thesis covers studies on creating and evaluating multimodal databases with clear and exaggerated motion. The main idea is that this type of motion is appropriate for agents under certain communicative situations (such as noisy environments) or for agents with reduced expressive degrees of freedom (such as humanoid robots). In paper D, we record motion capture data for a virtual talking head with variable articulation style (normal-to-over articulated). In paper E, we use techniques from mime acting to generate clear non-verbal expressions custom tailored for three agent embodiments (face-and-body, face-only and body-only).

    Download full text (pdf)
    fulltext
  • 3.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions2014In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 28, no 2, p. 607-618Article in journal (Refereed)
    Abstract [en]

    In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a,set of 180 short sentences, under three conditions: normal speech (in quiet), Lombard speech (in noise), and whispering. We then produced an animated 3D avatar with similar shape and appearance as the original talker and used an error minimization procedure to drive the animated version of the talker in a way that matched the original performance as closely as possible. In a perceptual intelligibility study with degraded audio we then compared the animated talker against the real talker and the audio alone, in terms of audio-visual word recognition rate across the three different production conditions. We found that the visual intelligibility of the animated talker was on par with the real talker for the Lombard and whisper conditions. In addition we created two incongruent conditions where normal speech audio was paired with animated Lombard speech or whispering. When compared to the congruent normal speech condition, Lombard animation yields a significant increase in intelligibility, despite the AV-incongruence. In a separate evaluation, we gathered subjective opinions on the different animations, and found that some degree of incongruence was generally accepted.

  • 4.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Can Anybody Read Me? Motion Capture Recordings for an Adaptable Visual Speech Synthesizer2012In: In proceedings of The Listening Talker, Edinburgh, UK., 2012, p. 52-52Conference paper (Refereed)
  • 5.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Towards Fully Automated Motion Capture of Signs -- Development and Evaluation of a Key Word Signing Avatar2015In: ACM Transactions on Accessible Computing, ISSN 1936-7228, Vol. 7, no 2, p. 7:1-7:17Article in journal (Refereed)
    Abstract [en]

    Motion capture of signs provides unique challenges in the field of multimodal data collection. The dense packaging of visual information requires high fidelity and high bandwidth of the captured data. Even though marker-based optical motion capture provides many desirable features such as high accuracy, global fitting, and the ability to record body and face simultaneously, it is not widely used to record finger motion, especially not for articulated and syntactic motion such as signs. Instead, most signing avatar projects use costly instrumented gloves, which require long calibration procedures. In this article, we evaluate the data quality obtained from optical motion capture of isolated signs from Swedish sign language with a large number of low-cost cameras. We also present a novel dual-sensor approach to combine the data with low-cost, five-sensor instrumented gloves to provide a recording method with low manual postprocessing. Finally, we evaluate the collected data and the dual-sensor approach as transferred to a highly stylized avatar. The application of the avatar is a game-based environment for training Key Word Signing (KWS) as augmented and alternative communication (AAC), intended for children with communication disabilities.

  • 6.
    Alexanderson, Simon
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Robust model training and generalisation with Studentising flows2020In: Proceedings of the ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models / [ed] Chin-Wei Huang, David Krueger, Rianne van den Berg, George Papamakarios, Chris Cremer, Ricky Chen, Danilo Rezende, 2020, Vol. 2, p. 25:1-25:9, article id 25Conference paper (Refereed)
    Abstract [en]

    Normalising flows are tractable probabilistic models that leverage the power of deep learning to describe a wide parametric family of distributions, all while remaining trainable using maximum likelihood. We discuss how these methods can be further improved based on insights from robust (in particular, resistant) statistics. Specifically, we propose to endow flow-based models with fat-tailed latent distributions such as multivariate Student's t, as a simple drop-in replacement for the Gaussian distribution used by conventional normalising flows. While robustness brings many advantages, this paper explores two of them: 1) We describe how using fatter-tailed base distributions can give benefits similar to gradient clipping, but without compromising the asymptotic consistency of the method. 2) We also discuss how robust ideas lead to models with reduced generalisation gap and improved held-out data likelihood. Experiments on several different datasets confirm the efficacy of the proposed approach in both regards.

    Download full text (pdf)
    alexanderson2020robust
  • 7.
    Alexanderson, Simon
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Kucherenko, Taras
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows2020In: Computer graphics forum (Print), ISSN 0167-7055, E-ISSN 1467-8659, Vol. 39, no 2, p. 487-496Article in journal (Refereed)
    Abstract [en]

    Automatic synthesis of realistic gestures promises to transform the fields of animation, avatars and communicative agents. In off-line applications, novel tools can alter the role of an animator to that of a director, who provides only high-level input for the desired animation; a learned network then translates these instructions into an appropriate sequence of body poses. In interactive scenarios, systems for generating natural animations on the fly are key to achieving believable and relatable characters. In this paper we address some of the core issues towards these ends. By adapting a deep learning-based motion synthesis method called MoGlow, we propose a new generative model for generating state-of-the-art realistic speech-driven gesticulation. Owing to the probabilistic nature of the approach, our model can produce a battery of different, yet plausible, gestures given the same input speech signal. Just like humans, this gives a rich natural variation of motion. We additionally demonstrate the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent. Such control can be leveraged to convey a desired character personality or mood. We achieve all this without any manual annotation of the data. User studies evaluating upper-body gesticulation confirm that the generated motions are natural and well match the input speech. Our method scores above all prior systems and baselines on these measures, and comes close to the ratings of the original recorded motions. We furthermore find that we can accurately control gesticulation styles without unnecessarily compromising perceived naturalness. Finally, we also demonstrate an application of the same method to full-body gesticulation, including the synthesis of stepping motion and stance.

    Download full text (pdf)
    fulltext
    Download full text (pdf)
    erratum
  • 8.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Aspects of co-occurring syllables and head nods in spontaneous dialogue2013In: Proceedings of 12th International Conference on Auditory-Visual Speech Processing (AVSP2013), The International Society for Computers and Their Applications (ISCA) , 2013, p. 169-172Conference paper (Refereed)
    Abstract [en]

    This paper reports on the extraction and analysis of head nods taken from motion capture data of spontaneous dialogue in Swedish. The head nods were extracted automatically and then manually classified in terms of gestures having a beat function or multifunctional gestures. Prosodic features were extracted from syllables co-occurring with the beat gestures. While the peak rotation of the nod is on average aligned with the stressed syllable, the results show considerable variation in fine temporal synchronization. The syllables co-occurring with the gestures generally show greater intensity, higher F0, and greater F0 range when compared to the mean across the entire dialogue. A functional analysis shows that the majority of the syllables belong to words bearing a focal accent.

  • 9.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Automatic annotation of gestural units in spontaneous face-to-face interaction2016In: MA3HMI 2016 - Proceedings of the Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, 2016, p. 15-19Conference paper (Refereed)
    Abstract [en]

    Speech and gesture co-occur in spontaneous dialogue in a highly complex fashion. There is a large variability in the motion that people exhibit during a dialogue, and different kinds of motion occur during different states of the interaction. A wide range of multimodal interface applications, for example in the fields of virtual agents or social robots, can be envisioned where it is important to be able to automatically identify gestures that carry information and discriminate them from other types of motion. While it is easy for a human to distinguish and segment manual gestures from a flow of multimodal information, the same task is not trivial to perform for a machine. In this paper we present a method to automatically segment and label gestural units from a stream of 3D motion capture data. The gestural flow is modeled with a 2-level Hierarchical Hidden Markov Model (HHMM) where the sub-states correspond to gesture phases. The model is trained based on labels of complete gesture units and self-adaptive manipulators. The model is tested and validated on two datasets differing in genre and in method of capturing motion, and outperforms a state-of-the-art SVM classifier on a publicly available dataset.

  • 10.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Extracting and analysing co-speech head gestures from motion-capture data2013In: Proceedings of Fonetik 2013 / [ed] Eklund, Robert, Linköping University Electronic Press, 2013, p. 1-4Conference paper (Refereed)
  • 11.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Extracting and analyzing head movements accompanying spontaneous dialogue2013In: Conference Proceedings TiGeR 2013: Tilburg Gesture Research Meeting, 2013Conference paper (Refereed)
    Abstract [en]

    This paper reports on a method developed for extracting and analyzing head gestures taken from motion capture data of spontaneous dialogue in Swedish. Candidate head gestures with beat function were extracted automatically and then manually classified using a 3D player which displays timesynced audio and 3D point data of the motion capture markers together with animated characters. Prosodic features were extracted from syllables co-occurring with a subset of the classified gestures. The beat gestures show considerable variation in temporal synchronization with the syllables, while the syllables generally show greater intensity, higher F0, and greater F0 range when compared to the mean across the entire dialogue. Additional features for further analysis and automatic classification of the head gestures are discussed.

  • 12.
    Alexanderson, Simon
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. Motorica AB, Sweden.
    Nagy, Rajmund
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. Motorica AB, Sweden.
    Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models2023In: ACM Transactions on Graphics, ISSN 0730-0301, E-ISSN 1557-7368, Vol. 42, no 4, article id 44Article in journal (Refereed)
    Abstract [en]

    Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest.

  • 13.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    O'Sullivan, C.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Robust online motion capture labeling of finger markers2016In: Proceedings - Motion in Games 2016: 9th International Conference on Motion in Games, MIG 2016, ACM Digital Library, 2016, p. 7-13Conference paper (Refereed)
    Abstract [en]

    Passive optical motion capture is one of the predominant technologies for capturing high fidelity human skeletal motion, and is a workhorse in a large number of areas such as bio-mechanics, film and video games. While most state-of-the-art systems can automatically identify and track markers on the larger parts of the human body, the markers attached to fingers provide unique challenges and usually require extensive manual cleanup. In this work we present a robust online method for identification and tracking of passive motion capture markers attached to the fingers of the hands. The method is especially suited for large capture volumes and sparse marker sets of 3 to 10 markers per hand. Once trained, our system can automatically initialize and track the markers, and the subject may exit and enter the capture volume at will. By using multiple assignment hypotheses and soft decisions, it can robustly recover from a difficult situation with many simultaneous occlusions and false observations (ghost markers). We evaluate the method on a collection of sparse marker sets commonly used in industry and in the research community. We also compare the results with two of the most widely used motion capture platforms: Motion Analysis Cortex and Vicon Blade. The results show that our method is better at attaining correct marker labels and is especially beneficial for real-time applications.

  • 14.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    O'Sullivan, Carol
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Real-time labeling of non-rigid motion capture marker sets2017In: Computers & graphics, ISSN 0097-8493, E-ISSN 1873-7684, Vol. 69, no Supplement C, p. 59-67Article in journal (Refereed)
    Abstract [en]

    Passive optical motion capture is one of the predominant technologies for capturing high fidelity human motion, and is a workhorse in a large number of areas such as bio-mechanics, film and video games. While most state-of-the-art systems can automatically identify and track markers on the larger parts of the human body, the markers attached to the fingers and face provide unique challenges and usually require extensive manual cleanup. In this work we present a robust online method for identification and tracking of passive motion capture markers attached to non-rigid structures. The method is especially suited for large capture volumes and sparse marker sets. Once trained, our system can automatically initialize and track the markers, and the subject may exit and enter the capture volume at will. By using multiple assignment hypotheses and soft decisions, it can robustly recover from a difficult situation with many simultaneous occlusions and false observations (ghost markers). In three experiments, we evaluate the method for labeling a variety of marker configurations for finger and facial capture. We also compare the results with two of the most widely used motion capture platforms: Motion Analysis Cortex and Vicon Blade. The results show that our method is better at attaining correct marker labels and is especially beneficial for real-time applications.

    Download full text (pdf)
    fulltext
  • 15.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    O'Sullivan, Carol
    Neff, Michael
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Mimebot—Investigating the Expressibility of Non-Verbal Communication Across Agent Embodiments2017In: ACM Transactions on Applied Perception, ISSN 1544-3558, E-ISSN 1544-3965, Vol. 14, no 4, article id 24Article in journal (Refereed)
    Abstract [en]

    Unlike their human counterparts, artificial agents such as robots and game characters may be deployed with a large variety of face and body configurations. Some have articulated bodies but lack facial features, and others may be talking heads ending at the neck. Generally, they have many fewer degrees of freedom than humans through which they must express themselves, and there will inevitably be a filtering effect when mapping human motion onto the agent. In this article, we investigate filtering effects on three types of embodiments: (a) an agent with a body but no facial features, (b) an agent with a head only, and (c) an agent with a body and a face. We performed a full performance capture of a mime actor enacting short interactions varying the non-verbal expression along five dimensions (e.g., level of frustration and level of certainty) for each of the three embodiments. We performed a crowd-sourced evaluation experiment comparing the video of the actor to the video of an animated robot for the different embodiments and dimensions. Our findings suggest that the face is especially important to pinpoint emotional reactions but is also most volatile to filtering effects. The body motion, on the other hand, had more diverse interpretations but tended to preserve the interpretation after mapping and thus proved to be more resilient to filtering.

    Download full text (pdf)
    fulltext
  • 16.
    Alexanderson, Simon
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Székely, Éva
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Kucherenko, Taras
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Generating coherent spontaneous speech and gesture from text2020In: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, IVA 2020, Association for Computing Machinery (ACM) , 2020Conference paper (Refereed)
    Abstract [en]

    Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthesise vivid and lifelike speech-driven 3D gesticulation. In this paper, we put these two state-of-the-art technologies together in a coherent fashion for the first time. Concretely, we demonstrate a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input. In contrast to previous approaches for joint speech-and-gesture generation, we generate full-body gestures from speech synthesis trained on recordings of spontaneous speech from the same person as the motion-capture data. We illustrate our results by visualising gesture spaces and textspeech-gesture alignments, and through a demonstration video.

  • 17.
    Ardal, Dui
    et al.
    KTH.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Lempert, Mirko
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    A Collaborative Previsualization Tool for Filmmaking in Virtual Reality2019In: Proceedings - CVMP 2019: 16th ACM SIGGRAPH European Conference on Visual Media Production, ACM Digital Library, 2019Conference paper (Refereed)
    Abstract [en]

    Previsualization is a process within pre-production of filmmaking where filmmakers can visually plan specific scenes with camera works, lighting, character movements, etc. The costs of computer graphics-based effects are substantial within film production. Using previsualization, these scenes can be planned in detail to reduce the amount of work put on effects in the later production phase. We develop and assess a prototype for previsualization in virtual reality for collaborative purposes where multiple filmmakers can be present in a virtual environment to share a creative work experience, remotely. By performing a within-group study on 20 filmmakers, our findings show that the use of virtual reality for distributed, collaborative previsualization processes is useful for real-life pre-production purposes.

    Download full text (pdf)
    Previs
  • 18.
    Beskow, Jonas
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Al Moubayed, Samer
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Kinetic Data for Large-Scale Analysis and Modeling of Face-to-Face Conversation2011In: Proceedings of International Conference on Audio-Visual Speech Processing 2011 / [ed] Salvi, G.; Beskow, J.; Engwall, O.; Al Moubayed, S., Stockholm: KTH Royal Institute of Technology, 2011, p. 103-106Conference paper (Refereed)
    Abstract [en]

    Spoken face to face interaction is a rich and complex form of communication that includes a wide array of phenomena thatare not fully explored or understood. While there has been extensive studies on many aspects in face-to-face interaction, these are traditionally of a qualitative nature, relying on hand annotated corpora, typically rather limited in extent, which is a natural consequence of the labour intensive task of multimodal data annotation. In this paper we present a corpus of 60 hours of unrestricted Swedish face-to-face conversations recorded with audio, video and optical motion capture, and we describe a new project setting out to exploit primarily the kinetic data in this corpus in order to gain quantitative knowledge on humanface-to-face interaction.

  • 19.
    Beskow, Jonas
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Stefanov, Kalin
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Claesson, Britt
    Derbring, Sandra
    Fredriksson, Morgan
    The Tivoli System - A Sign-driven Game for Children with Communicative Disorders2013Conference paper (Refereed)
  • 20.
    Beskow, Jonas
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Stefanov, Kalin
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Claesson, Britt
    Derbring, Sandra
    Fredriksson, Morgan
    Starck, J.
    Axelsson, E.
    Tivoli - Learning Signs Through Games and Interaction for Children with Communicative Disorders2014Conference paper (Refereed)
  • 21.
    Deichler, Anna
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Mehta, Shivam
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation2023In: PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, Association for Computing Machinery (ACM) , 2023, p. 755-762Conference paper (Refereed)
    Abstract [en]

    This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing difusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the difusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.

  • 22.
    Deichler, Anna
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Wang, Siyang
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Learning to generate pointing gestures in situated embodied conversational agents2023In: Frontiers in Robotics and AI, E-ISSN 2296-9144, Vol. 10, article id 1110534Article in journal (Refereed)
    Abstract [en]

    One of the main goals of robotics and intelligent agent research is to enable them to communicate with humans in physically situated settings. Human communication consists of both verbal and non-verbal modes. Recent studies in enabling communication for intelligent agents have focused on verbal modes, i.e., language and speech. However, in a situated setting the non-verbal mode is crucial for an agent to adapt flexible communication strategies. In this work, we focus on learning to generate non-verbal communicative expressions in situated embodied interactive agents. Specifically, we show that an agent can learn pointing gestures in a physically simulated environment through a combination of imitation and reinforcement learning that achieves high motion naturalness and high referential accuracy. We compared our proposed system against several baselines in both subjective and objective evaluations. The subjective evaluation is done in a virtual reality setting where an embodied referential game is played between the user and the agent in a shared 3D space, a setup that fully assesses the communicative capabilities of the generated gestures. The evaluations show that our model achieves a higher level of referential accuracy and motion naturalness compared to a state-of-the-art supervised learning motion synthesis model, showing the promise of our proposed system that combines imitation and reinforcement learning for generating communicative gestures. Additionally, our system is robust in a physically-simulated environment thus has the potential of being applied to robots.

  • 23.
    Deichler, Anna
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Wang, Siyang
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Towards Context-Aware Human-like Pointing Gestures with RL Motion Imitation2022Conference paper (Refereed)
    Abstract [en]

    Pointing is an important mode of interaction with robots. While large amounts of prior studies focus on recognition of human pointing, there is a lack of investigation into generating context-aware human-like pointing gestures, a shortcoming we hope to address. We first collect a rich dataset of human pointing gestures and corresponding pointing target locations with accurate motion capture. Analysis of the dataset shows that it contains various pointing styles, handedness, and well-distributed target positions in surrounding 3D space in both single-target pointing scenario and two-target point-and-place.We then train reinforcement learning (RL) control policies in physically realistic simulation to imitate the pointing motion in the dataset while maximizing pointing precision reward.We show that our RL motion imitation setup allows models to learn human-like pointing dynamics while maximizing task reward (pointing precision). This is promising for incorporating additional context in the form of task reward to enable flexible context-aware pointing behaviors in a physically realistic environment while retaining human-likeness in pointing motion dynamics.

    Download full text (pdf)
    fulltext
  • 24.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustavsson, Lisa
    Heldner, Mattias
    (Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics) .
    Hjalmarsson, Anna
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Kallionen, Petter
    Marklund, Ellen
    3rd party observer gaze as a continuous measure of dialogue flow2012In: Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey: European Language Resources Association, 2012, p. 1354-1358Conference paper (Refereed)
    Abstract [en]

    We present an attempt at using 3rd party observer gaze to get a measure of how appropriate each segment in a dialogue is for a speaker change. The method is a step away from the current dependency of speaker turns or talkspurts towards a more general view of speaker changes. We show that 3rd party observers do indeed largely look at the same thing (the speaker), and how this can be captured and utilized to provide insights into human communication. In addition, the results also suggest that there might be differences in the distribution of 3rd party observer gaze depending on how information-rich an utterance is.

  • 25.
    Frid, Emma
    et al.
    KTH, School of Computer Science and Communication (CSC), Media Technology and Interaction Design, MID.
    Bresin, Roberto
    KTH, School of Computer Science and Communication (CSC), Media Technology and Interaction Design, MID.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Perception of Mechanical Sounds Inherent to Expressive Gestures of a NAO Robot - Implications for Movement Sonification of Humanoids2018In: Proceedings of the 15th Sound and Music Computing Conference / [ed] Anastasia Georgaki and Areti Andreopoulou, Limassol, Cyprus, 2018Conference paper (Refereed)
    Abstract [en]

    In this paper we present a pilot study carried out within the project SONAO. The SONAO project aims to compen- sate for limitations in robot communicative channels with an increased clarity of Non-Verbal Communication (NVC) through expressive gestures and non-verbal sounds. More specifically, the purpose of the project is to use move- ment sonification of expressive robot gestures to improve Human-Robot Interaction (HRI). The pilot study described in this paper focuses on mechanical robot sounds, i.e. sounds that have not been specifically designed for HRI but are inherent to robot movement. Results indicated a low correspondence between perceptual ratings of mechanical robot sounds and emotions communicated through ges- tures. In general, the mechanical sounds themselves ap- peared not to carry much emotional information compared to video stimuli of expressive gestures. However, some mechanical sounds did communicate certain emotions, e.g. frustration. In general, the sounds appeared to commu- nicate arousal more effectively than valence. We discuss potential issues and possibilities for the sonification of ex- pressive robot gestures and the role of mechanical sounds in such a context. Emphasis is put on the need to mask or alter sounds inherent to robot movement, using for exam- ple blended sonification.

  • 26.
    Henter, Gustav Eje
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Moglow: Probabilistic and controllable motion synthesis using normalising flows2019Manuscript (preprint) (Other academic)
    Abstract [en]

    Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motiondata models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood, unlike GANs or VAEs. Our proposed model is autoregressive and uses LSTMs to enable arbitrarily long time-dependencies. Importantly, is is also causal, meaning that each pose in the output sequence is generated without access to poses or control inputs from future time steps; this absence of algorithmic latency is important for interactive applications with real-time motion control. The approach can in principle be applied to any type of motion since it does not make restrictive assumptions such as the motion being cyclic in nature. We evaluate the models on motion-capture datasets of human and quadruped locomotion. Objective and subjective results show that randomly-sampled motion from the proposed method attains a motion quality close to recorded motion capture for both humans and animals.

  • 27.
    Henter, Gustav Eje
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    MoGlow: Probabilistic and controllable motion synthesis using normalising flows2020In: ACM Transactions on Graphics, ISSN 0730-0301, E-ISSN 1557-7368, Vol. 39, no 6, p. 1-14, article id 236Article in journal (Refereed)
    Abstract [en]

    Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motion-data models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood, unlike GANs or VAEs. Our proposed model is autoregressive and uses LSTMs to enable arbitrarily long time-dependencies. Importantly, is is also causal, meaning that each pose in the output sequence is generated without access to poses or control inputs from future time steps; this absence of algorithmic latency is important for interactive applications with real-time motion control. The approach can in principle be applied to any type of motion since it does not make restrictive, task-specific assumptions regarding the motion or the character morphology. We evaluate the models on motion-capture datasets of human and quadruped locomotion. Objective and subjective results show that randomly-sampled motion from the proposed method outperforms task-agnostic baselines and attains a motion quality close to recorded motion capture.

    Download full text (pdf)
    fulltext
  • 28.
    House, David
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    On the temporal domain of co-speech gestures: syllable, phrase or talk spurt?2015In: Proceedings of Fonetik 2015 / [ed] Lundmark Svensson, M.; Ambrazaitis, G.; van de Weijer, J., 2015, p. 63-68Conference paper (Other academic)
    Abstract [en]

    This study explores the use of automatic methods to detect and extract handgesture movement co-occuring with speech. Two spontaneous dyadic dialogueswere analyzed using 3D motion-capture techniques to track hand movement.Automatic speech/non-speech detection was performed on the dialogues resultingin a series of connected talk spurts for each speaker. Temporal synchrony of onsetand offset of gesture and speech was studied between the automatic hand gesturetracking and talk spurts, and compared to an earlier study of head nods andsyllable synchronization. The results indicated onset synchronization between headnods and the syllable in the short temporal domain and between the onset of longergesture units and the talk spurt in a more extended temporal domain.

  • 29.
    Kammerlander, Robin K.
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Using Virtual Reality to Support Acting in Motion Capture with Differently Scaled Characters2021In: 2021 IEEE VIRTUAL REALITY AND 3D USER INTERFACES (VR), Institute of Electrical and Electronics Engineers (IEEE) , 2021, p. 402-410Conference paper (Refereed)
    Abstract [en]

    Motion capture is a well-established technology for capturing actors' movements and performances within the entertainment industry. Many actors, however, witness the poor acting conditions associated with such recordings. Instead of detailed sets, costumes and props, they are forced to play in empty spaces wearing tight suits. Often, their co-actors will be imaginary, replaced by placeholder props, or they would be out of scale with their virtual counterparts. These problems do not only affect acting, they also cause an abundance of laborious post-processing clean-up work. To solve these challenges, we propose using a combination of virtual reality and motion capture technology to bring differently proportioned virtual characters into a shared collaborative virtual environment. A within-subjects user study with trained actors showed that our proposed platform enhances their feelings of body ownership and immersion. This in turn changed actors' performances which narrowed the gap between virtual performances and final intended animations.

  • 30.
    Karipidou, Kelly
    et al.
    KTH, School of Computer Science and Communication (CSC), Robotics, perception and learning, RPL.
    Ahnlund, Josefin
    KTH, School of Computer Science and Communication (CSC), Robotics, perception and learning, RPL.
    Friberg, Anders
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Kjellström, Hedvig
    KTH, School of Computer Science and Communication (CSC), Robotics, perception and learning, RPL.
    Computer Analysis of Sentiment Interpretation in Musical Conducting2017In: Proceedings - 12th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2017, IEEE, 2017, p. 400-405, article id 7961769Conference paper (Refereed)
    Abstract [en]

    This paper presents a unique dataset consisting of 20 recordings of the same musical piece, conducted with 4 different musical intentions in mind. The upper body and baton motion of a professional conductor was recorded, as well as the sound of each instrument in a professional string quartet following the conductor. The dataset is made available for benchmarking of motion recognition algorithms. An HMM-based emotion intent classification method is trained with subsets of the data, and classification of other subsets of the data show firstly that the motion of the baton communicates energetic intention to a high degree, secondly, that the conductor’s torso, head and other arm conveys calm intention to a high degree, and thirdly, that positive vs negative sentiments are communicated to a high degree through other channels than the body and baton motion – most probably, through facial expression and muscle tension conveyed through articulated hand and finger motion. The long-term goal of this work is to develop a computer model of the entire conductor-orchestra communication pro- cess; the studies presented here indicate that computer modeling of the conductor-orchestra communication is feasible.

    Download full text (pdf)
    fulltext
  • 31.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Avramova, Vanya
    KTH.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Jonell, Patrik
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Oertel, Catharine
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 119-127Conference paper (Refereed)
    Abstract [en]

    In this paper we present a corpus of multiparty situated interaction where participants collaborated on moving virtual objects on a large touch screen. A moderator facilitated the discussion and directed the interaction. The corpus contains recordings of a variety of multimodal data, in that we captured speech, eye gaze and gesture data using a multisensory setup (wearable eye trackers, motion capture and audio/video). Furthermore, in the description of the multimodal corpus, we investigate four different types of social gaze: referential gaze, joint attention, mutual gaze and gaze aversion by both perspectives of a speaker and a listener. We annotated the groups’ object references during object manipulation tasks and analysed the group’s proportional referential eye-gaze with regards to the referent object. When investigating the distributions of gaze during and before referring expressions we could corroborate the differences in time between speakers’ and listeners’ eye gaze found in earlier studies. This corpus is of particular interest to researchers who are interested in social eye-gaze patterns in turn-taking and referring language in situated multi-party interaction.

    Download full text (pdf)
    fulltext
  • 32.
    Kucherenko, Taras
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Jonell, Patrik
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    van Waveren, Sanne
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Leite, Iolanda
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Kjellström, Hedvig
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Gesticulator: A framework for semantically-aware speech-driven gesture generation2020In: ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction, Association for Computing Machinery (ACM) , 2020Conference paper (Refereed)
    Abstract [en]

    During speech, people spontaneously gesticulate, which plays akey role in conveying information. Similarly, realistic co-speechgestures are crucial to enable natural and smooth interactions withsocial agents. Current end-to-end co-speech gesture generationsystems use a single modality for representing speech: either au-dio or text. These systems are therefore confined to producingeither acoustically-linked beat gestures or semantically-linked ges-ticulation (e.g., raising a hand when saying “high”): they cannotappropriately learn to generate both gesture types. We present amodel designed to produce arbitrary beat and semantic gesturestogether. Our deep-learning based model takes both acoustic andsemantic representations of speech as input, and generates gesturesas a sequence of joint angle rotations as output. The resulting ges-tures can be applied to both virtual agents and humanoid robots.Subjective and objective evaluations confirm the success of ourapproach. The code and video are available at the project page svito-zar.github.io/gesticula

  • 33.
    Valle-Perez, Guillermo
    et al.
    Univ Bordeaux, Ensta ParisTech, Bordeaux, France..
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Holzapfel, Andre
    KTH, School of Electrical Engineering and Computer Science (EECS), Human Centered Technology, Media Technology and Interaction Design, MID.
    Oudeyer, Pierre-Yves
    Univ Bordeaux, Ensta ParisTech, Bordeaux, France..
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Transflower: probabilistic autoregressive dance generation with multimodal attention2021In: ACM Transactions on Graphics, ISSN 0730-0301, E-ISSN 1557-7368, Vol. 40, no 6, article id 195Article in journal (Refereed)
    Abstract [en]

    Dance requires skillful composition of complex movements that follow rhythmic, tonal and timbral features of music. Formally, generating dance conditioned on a piece of music can be expressed as a problem of modelling a high-dimensional continuous motion signal, conditioned on an audio signal. In this work we make two contributions to tackle this problem. First, we present a novel probabilistic autoregressive architecture that models the distribution over future poses with a normalizing flow conditioned on previous poses as well as music context, using a multimodal transformer encoder. Second, we introduce the currently largest 3D dance-motion dataset, obtained with a variety of motion-capture technologies, and including both professional and casual dancers. Using this dataset, we compare our new model against two baselines, via objective metrics and a user study, and show that both the ability to model a probability distribution, as well as being able to attend over a large motion and music context are necessary to produce interesting, diverse, and realistic dance that matches the music.

  • 34.
    Vijayan, Aravind Elanjimattathil
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Leite, Iolanda
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Using Constrained Optimization for Real-Time Synchronization of Verbal and Nonverbal Robot Behavior2018In: 2018 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), IEEE Computer Society, 2018, p. 1955-1961Conference paper (Refereed)
    Abstract [en]

    Most of the motion re-targeting techniques are grounded on virtual character animation research, which means that they typically assume that the target embodiment has unconstrained joint angular velocities. However, because robots often do have such constraints, traditional re-targeting approaches can originate irregular delays in the robot motion. With the goal of ensuring synchronization between verbal and nonverbal behavior, this paper proposes an optimization framework for processing re-targeted motion sequences that addresses constraints such as joint angle and angular velocities. The proposed framework was evaluated on a humanoid robot using both objective and subjective metrics. While the analysis of the joint motion trajectories provides evidence that our framework successfully performs the desired modifications to ensure verbal and nonverbal behavior synchronization, results from a perceptual study showed that participants found the robot motion generated by our method more natural, elegant and lifelike than a control condition.

  • 35.
    Wang, Siyang
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Gustafsson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Székely, Éva
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Integrated Speech and Gesture Synthesis2021In: ICMI 2021 - Proceedings of the 2021 International Conference on Multimodal Interaction, Association for Computing Machinery (ACM) , 2021, p. 177-185Conference paper (Refereed)
    Abstract [en]

    Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline. This can lead to modeling inefficiencies and may introduce inconsistencies that limit the achievable naturalness. We propose to instead synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG). We also propose a set of models modified from state-of-the-art neural speech-synthesis engines to achieve this goal. We evaluate the models in three carefully-designed user studies, two of which evaluate the synthesized speech and gesture in isolation, plus a combined study that evaluates the models like they will be used in real-world applications - speech and gesture presented together. The results show that participants rate one of the proposed integrated synthesis models as being as good as the state-of-the-art pipeline system we compare against, in all three tests. The model is able to achieve this with faster synthesis time and greatly reduced parameter count compared to the pipeline system, illustrating some of the potential benefits of treating speech and gesture synthesis together as a single, unified problem.

    Download full text (pdf)
    fulltext
  • 36. Zellers, M.
    et al.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Prosody and hand gesture at turn boundaries in Swedish2016In: Proceedings of the International Conference on Speech Prosody, International Speech Communications Association , 2016, p. 831-835Conference paper (Refereed)
    Abstract [en]

    In order to ensure smooth turn-taking between conversational participants, interlocutors must have ways of providing information to one another about whether they have finished speaking or intend to continue. The current work investigates Swedish speakers’ use of hand gestures in conjunction with turn change or turn hold in unrestricted, spontaneous speech. As has been reported by other researchers, we find that speakers’ gestures end before the end of speech in cases of turn change, while they may extend well beyond the end of a given speech chunk in the case of turn hold. We investigate the degree to which prosodic cues and gesture cues to turn transition in Swedish face-to-face conversation are complementary versus functioning additively. The co-occurrence of acoustic prosodic features and gesture at potential turn boundaries gives strong support for considering hand gestures as part of the prosodic system, particularly in the context of discourse-level information such as maintaining smooth turn transition.

1 - 36 of 36
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf