kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 165) Show all publications
Deichler, A., Mehta, S., Alexanderson, S. & Beskow, J. (2023). Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation. In: PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023: . Paper presented at 25th International Conference on Multimodal Interaction (ICMI), OCT 09-13, 2023, Sorbonne Univ, Paris, FRANCE (pp. 755-762). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
2023 (English)In: PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, Association for Computing Machinery (ACM) , 2023, p. 755-762Conference paper, Published paper (Refereed)
Abstract [en]

This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing difusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the difusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
Keywords
gesture generation, motion synthesis, difusion models, contrastive pre-training, semantic gestures
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-343773 (URN)10.1145/3577190.3616117 (DOI)001147764700093 ()2-s2.0-85170496681 (Scopus ID)
Conference
25th International Conference on Multimodal Interaction (ICMI), OCT 09-13, 2023, Sorbonne Univ, Paris, FRANCE
Note

Part of proceedings ISBN 979-8-4007-0055-2

QC 20240222

Available from: 2024-02-22 Created: 2024-02-22 Last updated: 2024-03-05Bibliographically approved
Gustafsson, J., Székely, É. & Beskow, J. (2023). Generation of speech and facial animation with controllable articulatory effort for amusing conversational characters. In: 23rd ACM International Conference on Interlligent Virtual Agent (IVA 2023): . Paper presented at 23rd ACM International Conference on Intelligent Virtual Agent (IVA 2023), Würzburg, Germany, Jan 5 2023 - Jan 8 2023. Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Generation of speech and facial animation with controllable articulatory effort for amusing conversational characters
2023 (English)In: 23rd ACM International Conference on Interlligent Virtual Agent (IVA 2023), Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Engaging embodied conversational agents need to generate expressive behavior in order to be believable insocializing interactions. We present a system that can generate spontaneous speech with supporting lip movements. The neural conversational TTSvoice is trained on a multi-style speech corpus that has been prosodically tagged (pitch and speaking rate) and transcribed (including tokens for breathing, fillers and laughter). We introduce a speech animation algorithm where articulatory effort can be adjusted. The facial animation is driven by time-stamped phonemes and prominence estimates from the synthesised speech waveform to modulate the lip and jaw movements accordingly. In objective evaluations we show that the system is able to generate speech and facial animation that vary in articulation effort. In subjective evaluations we compare our conversational TTS system’s capability to deliver jokes with a commercial TTS. Both systems succeeded equally good.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
National Category
Language Technology (Computational Linguistics) Robotics
Identifiers
urn:nbn:se:kth:diva-341039 (URN)10.1145/3570945.3607289 (DOI)2-s2.0-85183581153 (Scopus ID)
Conference
23rd ACM International Conference on Intelligent Virtual Agent (IVA 2023), Würzburg, Germany, Jan 5 2023 - Jan 8 2023
Note

Part of ISBN 9798350345445

QC 20231124

Available from: 2023-12-19 Created: 2023-12-19 Last updated: 2024-02-09Bibliographically approved
Miniotaitė, J., Wang, S., Beskow, J., Gustafson, J., Székely, É. & Abelho Pereira, A. T. (2023). Hi robot, it's not what you say, it's how you say it. In: 2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN: . Paper presented at 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), AUG 28-31, 2023, Busan, SOUTH KOREA (pp. 307-314). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Hi robot, it's not what you say, it's how you say it
Show others...
2023 (English)In: 2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN, Institute of Electrical and Electronics Engineers (IEEE) , 2023, p. 307-314Conference paper, Published paper (Refereed)
Abstract [en]

Many robots use their voice to communicate with people in spoken language but the voices commonly used for robots are often optimized for transactional interactions, rather than social ones. This can limit their ability to create engaging and natural interactions. To address this issue, we designed a spontaneous text-to-speech tool and used it to author natural and spontaneous robot speech. A crowdsourcing evaluation methodology is proposed to compare this type of speech to natural speech and state-of-the-art text-to-speech technology, both in disembodied and embodied form. We created speech samples in a naturalistic setting of people playing tabletop games and conducted a user study evaluating Naturalness, Intelligibility, Social Impression, Prosody, and Perceived Intelligence. The speech samples were chosen to represent three contexts that are common in tabletopgames and the contexts were introduced to the participants that evaluated the speech samples. The study results show that the proposed evaluation methodology allowed for a robust analysis that successfully compared the different conditions. Moreover, the spontaneous voice met our target design goal of being perceived as more natural than a leading commercial text-to-speech.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Series
IEEE RO-MAN, ISSN 1944-9445
Keywords
speech synthesis, human-robot interaction, embodiment, spontaneous speech, intelligibility, naturalness
National Category
Interaction Technologies
Identifiers
urn:nbn:se:kth:diva-341972 (URN)10.1109/RO-MAN57019.2023.10309427 (DOI)001108678600044 ()2-s2.0-85186982397 (Scopus ID)
Conference
32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), AUG 28-31, 2023, Busan, SOUTH KOREA
Note

Part of proceedings ISBN 979-8-3503-3670-2

Available from: 2024-01-09 Created: 2024-01-09 Last updated: 2024-03-22Bibliographically approved
Deichler, A., Wang, S., Alexanderson, S. & Beskow, J. (2023). Learning to generate pointing gestures in situated embodied conversational agents. Frontiers in Robotics and AI, 10, Article ID 1110534.
Open this publication in new window or tab >>Learning to generate pointing gestures in situated embodied conversational agents
2023 (English)In: Frontiers in Robotics and AI, E-ISSN 2296-9144, Vol. 10, article id 1110534Article in journal (Refereed) Published
Abstract [en]

One of the main goals of robotics and intelligent agent research is to enable them to communicate with humans in physically situated settings. Human communication consists of both verbal and non-verbal modes. Recent studies in enabling communication for intelligent agents have focused on verbal modes, i.e., language and speech. However, in a situated setting the non-verbal mode is crucial for an agent to adapt flexible communication strategies. In this work, we focus on learning to generate non-verbal communicative expressions in situated embodied interactive agents. Specifically, we show that an agent can learn pointing gestures in a physically simulated environment through a combination of imitation and reinforcement learning that achieves high motion naturalness and high referential accuracy. We compared our proposed system against several baselines in both subjective and objective evaluations. The subjective evaluation is done in a virtual reality setting where an embodied referential game is played between the user and the agent in a shared 3D space, a setup that fully assesses the communicative capabilities of the generated gestures. The evaluations show that our model achieves a higher level of referential accuracy and motion naturalness compared to a state-of-the-art supervised learning motion synthesis model, showing the promise of our proposed system that combines imitation and reinforcement learning for generating communicative gestures. Additionally, our system is robust in a physically-simulated environment thus has the potential of being applied to robots.

Place, publisher, year, edition, pages
Frontiers Media SA, 2023
Keywords
reinforcement learning, imitation learning, non-verbal communication, embodied interactive agents, gesture generation, physics-aware machine learning
National Category
Human Computer Interaction
Identifiers
urn:nbn:se:kth:diva-326625 (URN)10.3389/frobt.2023.1110534 (DOI)000970385800001 ()37064574 (PubMedID)2-s2.0-85153351800 (Scopus ID)
Note

QC 20230508

Available from: 2023-05-08 Created: 2023-05-08 Last updated: 2023-05-08Bibliographically approved
Alexanderson, S., Nagy, R., Beskow, J. & Henter, G. E. (2023). Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models. ACM Transactions on Graphics, 42(4), Article ID 44.
Open this publication in new window or tab >>Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models
2023 (English)In: ACM Transactions on Graphics, ISSN 0730-0301, E-ISSN 1557-7368, Vol. 42, no 4, article id 44Article in journal (Refereed) Published
Abstract [en]

Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
Keywords
conformers, dance, diffusion models, ensemble models, generative models, gestures, guided interpolation, locomotion, machine learning, product of experts
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-335345 (URN)10.1145/3592458 (DOI)001044671300010 ()2-s2.0-85166332883 (Scopus ID)
Note

QC 20230907

Available from: 2023-09-07 Created: 2023-09-07 Last updated: 2023-09-22Bibliographically approved
Mehta, S., Kirkland, A., Lameris, H., Beskow, J., Székely, É. & Henter, G. E. (2023). OverFlow: Putting flows on top of neural transducers for better TTS. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023 (pp. 4279-4283). International Speech Communication Association
Open this publication in new window or tab >>OverFlow: Putting flows on top of neural transducers for better TTS
Show others...
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 4279-4283Conference paper, Published paper (Refereed)
Abstract [en]

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
acoustic modelling, Glow, hidden Markov models, invertible post-net, Probabilistic TTS
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-338584 (URN)10.21437/Interspeech.2023-1996 (DOI)2-s2.0-85167953412 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023
Note

QC 20231107

Available from: 2023-11-07 Created: 2023-11-07 Last updated: 2023-11-07Bibliographically approved
Cohn, M., Keaton, A., Beskow, J. & Zellou, G. (2023). Vocal accommodation to technology: the role of physical form. Language sciences (Oxford), 99, Article ID 101567.
Open this publication in new window or tab >>Vocal accommodation to technology: the role of physical form
2023 (English)In: Language sciences (Oxford), ISSN 0388-0001, E-ISSN 1873-5746, Vol. 99, article id 101567Article in journal (Refereed) Published
Abstract [en]

This study examines participants’ vocal accommodation toward text-to-speech (TTS) voices produced by three devices, varying in the extent to which they embody a human form. Thirty eight speakers shadowed words produced by a male and female TTS voice presented across three physical forms: an Amazon Echo smart speaker (least human-like), Nao robot (slightly more human-like), and a Furhat robot (more human-like). Ninety-six independent raters completed a separate AXB perceptual similarity assessment, which provides a holistic evaluation of accommodation. Results show convergence to the voices across all physical forms; convergence is even stronger toward the female TTS voice when presented with the Echo smart speaker form in the female TTS voice, consistent with participants' higher rated likability and lower creepiness of the Echo. We interpret our findings through the lens of communication accommodation theory (CAT), providing support for accounts of speech communication and human–computer interaction frameworks.

Place, publisher, year, edition, pages
Elsevier Ltd, 2023
Keywords
Communication accommodation theory, Human-computer interaction, Physical form, Speech accommodation
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-334351 (URN)10.1016/j.langsci.2023.101567 (DOI)001046831500001 ()2-s2.0-85164732638 (Scopus ID)
Note

QC 20230821

Available from: 2023-08-21 Created: 2023-08-21 Last updated: 2023-09-01Bibliographically approved
Mehta, S., Székely, É., Beskow, J. & Henter, G. E. (2022). Neural HMMs are all you need (for high-quality attention-free TTS). In: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): . Paper presented at 47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 23-27, 2022, Singapore, Singapore (pp. 7457-7461). IEEE Signal Processing Society
Open this publication in new window or tab >>Neural HMMs are all you need (for high-quality attention-free TTS)
2022 (English)In: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE Signal Processing Society, 2022, p. 7457-7461Conference paper, Published paper (Refereed)
Abstract [en]

Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and uses non-monotonic attention. Attention failures increase training time and can make synthesis babble incoherently. This paper describes how the old and new paradigms can be combined to obtain the advantages of both worlds, by replacing attention in neural TTS with an autoregressive left-right no-skip hidden Markov model defined by a neural network. Based on this proposal, we modify Tacotron 2 to obtain an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximation. We also describe how to combine ideas from classical and contemporary TTS for best results. The resulting example system is smaller and simpler than Tacotron 2, and learns to speak with fewer iterations and less data, whilst achieving comparable naturalness prior to the post-net. Our approach also allows easy control over speaking rate.

Place, publisher, year, edition, pages
IEEE Signal Processing Society, 2022
Series
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ISSN 2379-190X
Keywords
seq2seq, attention, HMMs, duration modelling, acoustic modelling
National Category
Language Technology (Computational Linguistics) Probability Theory and Statistics Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-312455 (URN)10.1109/ICASSP43922.2022.9746686 (DOI)000864187907152 ()2-s2.0-85131260082 (Scopus ID)
Conference
47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 23-27, 2022, Singapore, Singapore
Funder
Knut and Alice Wallenberg Foundation, WASP
Note

Part of proceedings: ISBN 978-1-6654-0540-9

QC 20220601

Available from: 2022-05-18 Created: 2022-05-18 Last updated: 2023-01-12Bibliographically approved
Moell, B., O'Regan, J., Mehta, S., Kirkland, A., Lameris, H., Gustafsson, J. & Beskow, J. (2022). Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge. In: Dimitrios Kokkinakis, Charalambos K. Themistocleous, Kristina Lundholm Fors, Athanasios Tsanas, Kathleen C. Fraser (Ed.), The RaPID4 Workshop: Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments. Paper presented at 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, RAPID 2022, Marseille, France, Jun 25 2022 (pp. 62-70). Marseille, France
Open this publication in new window or tab >>Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge
Show others...
2022 (English)In: The RaPID4 Workshop: Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments / [ed] Dimitrios Kokkinakis, Charalambos K. Themistocleous, Kristina Lundholm Fors, Athanasios Tsanas, Kathleen C. Fraser, Marseille, France, 2022, p. 62-70Conference paper, Published paper (Refereed)
Abstract [en]

As part of the PSST challenge, we explore how data augmentations, data sources, and model size affect phoneme transcription accuracy on speech produced by individuals with aphasia. We evaluate model performance in terms of feature error rate (FER) and phoneme error rate (PER). We find that data augmentations techniques, such as pitch shift, improve model performance. Additionally, increasing the size of the model decreases FER and PER. Our experiments also show that adding manually-transcribed speech from non-aphasic speakers (TIMIT) improves performance when Room Impulse Response is used to augment the data. The best performing model combines aphasic and non-aphasic data and has a 21.0% PER and a 9.2% FER, a relative improvement of 9.8% compared to the baseline model on the primary outcome measurement. We show that data augmentation, larger model size, and additional non-aphasic data sources can be helpful in improving automatic phoneme recognition models for people with aphasia.

Place, publisher, year, edition, pages
Marseille, France: , 2022
Keywords
aphasia, data augmentation, phoneme transcription, phonemes, speech, speech data augmentation, wav2vec 2.0
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-314262 (URN)2-s2.0-85145876107 (Scopus ID)
Conference
4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, RAPID 2022, Marseille, France, Jun 25 2022
Note

QC 20220815

Available from: 2022-06-17 Created: 2022-06-17 Last updated: 2023-08-14Bibliographically approved
Deichler, A., Wang, S., Alexanderson, S. & Beskow, J. (2022). Towards Context-Aware Human-like Pointing Gestures with RL Motion Imitation. In: : . Paper presented at Context-Awareness in Human-Robot Interaction: Approaches and Challenges, workshop at 2022 ACM/IEEE International Conference on Human-Robot Interaction (pp. 2022).
Open this publication in new window or tab >>Towards Context-Aware Human-like Pointing Gestures with RL Motion Imitation
2022 (English)Conference paper, Oral presentation with published abstract (Refereed)
Abstract [en]

Pointing is an important mode of interaction with robots. While large amounts of prior studies focus on recognition of human pointing, there is a lack of investigation into generating context-aware human-like pointing gestures, a shortcoming we hope to address. We first collect a rich dataset of human pointing gestures and corresponding pointing target locations with accurate motion capture. Analysis of the dataset shows that it contains various pointing styles, handedness, and well-distributed target positions in surrounding 3D space in both single-target pointing scenario and two-target point-and-place.We then train reinforcement learning (RL) control policies in physically realistic simulation to imitate the pointing motion in the dataset while maximizing pointing precision reward.We show that our RL motion imitation setup allows models to learn human-like pointing dynamics while maximizing task reward (pointing precision). This is promising for incorporating additional context in the form of task reward to enable flexible context-aware pointing behaviors in a physically realistic environment while retaining human-likeness in pointing motion dynamics.

Keywords
motion generation, reinforcement learning, referring actions, pointing gestures, human-robot interaction, motion capture
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-313480 (URN)
Conference
Context-Awareness in Human-Robot Interaction: Approaches and Challenges, workshop at 2022 ACM/IEEE International Conference on Human-Robot Interaction
Note

QC 20220607

Available from: 2022-06-03 Created: 2022-06-03 Last updated: 2022-06-25Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-1399-6604

Search in DiVA

Show all publications