kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-3135-5683
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-1886-681X
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-7801-7617
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1399-6604
2023 (English)In: PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, Association for Computing Machinery (ACM) , 2023, p. 755-762Conference paper, Published paper (Refereed)
Abstract [en]

This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing difusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the difusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM) , 2023. p. 755-762
Keywords [en]
gesture generation, motion synthesis, difusion models, contrastive pre-training, semantic gestures
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:kth:diva-343773DOI: 10.1145/3577190.3616117ISI: 001147764700093Scopus ID: 2-s2.0-85170496681OAI: oai:DiVA.org:kth-343773DiVA, id: diva2:1840233
Conference
25th International Conference on Multimodal Interaction (ICMI), OCT 09-13, 2023, Sorbonne Univ, Paris, FRANCE
Note

Part of proceedings ISBN 979-8-4007-0055-2

QC 20240222

Available from: 2024-02-22 Created: 2024-02-22 Last updated: 2024-03-05Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Deichler, AnnaMehta, ShivamAlexanderson, SimonBeskow, Jonas

Search in DiVA

By author/editor
Deichler, AnnaMehta, ShivamAlexanderson, SimonBeskow, Jonas
By organisation
Speech, Music and Hearing, TMH
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 43 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf