kth.sePublications KTH
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Grounded Gesture Generation: Language, Motion and Space
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing.ORCID iD: 0000-0003-3135-5683
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing.ORCID iD: 0000-0003-2598-6868
Sorbonne University.
KTH.
Show others and affiliations
2025 (English)In: Article, review/survey (Other academic) Accepted
Abstract [en]

Human motion generation has advanced rapidly in recent years, yet the critical problem of creating spatially grounded, context-aware gestures has been largely overlooked. Existing models typically specialize either in descriptive motion generation, such as locomotion and object interaction, or in isolated co-speech gesture synthesis aligned with utterance semantics. However, both lines of work often treat motion and environmental grounding separately, limiting advances toward embodied, communicative agents. To address this gap, our work introduces a multimodal dataset and framework for grounded gesture generation, combining two key resources: (1) a synthetic dataset of spatially grounded referential gestures, and (2) MM-Conv, a VR-based dataset capturing two-party dialogues. Together, they provide over 7.7 hours of synchronized motion, speech, and 3D scene information, standardized in the HumanML3D format. Our framework further connects to a physics-based simulator, enabling synthetic data generation and situated evaluation. By bridging gesture modeling and spatial grounding, our contribution establishes a foundation for advancing research in situated gesture generation and grounded multimodal interaction.

Place, publisher, year, edition, pages
2025.
National Category
Computer and Information Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-382213OAI: oai:DiVA.org:kth-382213DiVA, id: diva2:2062264
Conference
Workshop on Humanoid Agents (HSI), IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025, Nashville, USA, Jun 11, 2025
Note

QC 20260525

Available from: 2026-05-25 Created: 2026-05-25 Last updated: 2026-05-25Bibliographically approved
In thesis
1. Spatially Grounded Communication in Embodied Agents: From Gesture Generation to Referential Understanding
Open this publication in new window or tab >>Spatially Grounded Communication in Embodied Agents: From Gesture Generation to Referential Understanding
2026 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

When a person says "put that over there" while pointing at a shelf, the meaning depends on the spatial relationship between speaker, listener, and shared physical scene. Embodied agents that participate in such interactions must both produce spatially grounded gestures and interpret multimodal references. Yet these capabilities have largely been studied in isolation, with separate data, methods, and evaluation paradigms.

This thesis argues that gesture generation and referential grounding are two sides of the same communicative process, and that studying them jointly reveals structure that neither subfield surfaces alone. The argument is developed across seven papers. On the production side, contrastive speech-motion pretraining enables semantically aware co-speech gesture generation, while reinforcement learning with adversarial motion priors produces pointing gestures that are both spatially accurate and motorically natural, outperforming supervised baselines in a human referential identification study. A flow-matching architecture further combines semantic and spatial conditioning within a single generative system through distinct pathways.

On the comprehension side, the thesis introduces multimodal conversational datasets recorded in virtual reality and with wearable AR sensors, combining full-body motion, gaze, speech, and 3D scene context. Experiments show that state-of-the-art vision–language models fail on conversational references not for lack of perceptual capability, but because they cannot determine what is being referred to from underspecified language. A rewrite-based decoupling experiment isolates this bottleneck: once the referent is explicitly described, even simple detectors localize it accurately.

A central finding across both threads is that semantic reasoning, what is being communicated, and spatial reasoning, where it is directed, benefit from separate architectural treatment. On the production side, audio conditioning drives gesture timing while spatial targets determine direction; on the comprehension side, linguistic reasoning identifies the referent while visual perception localizes it. In both cases, architectures that maintain this separation outperform those that conflate heterogeneous signals into a shared representation. A shared data infrastructure, built incrementally across the papers, makes this parallel empirically testable: the same referential annotations that define conditioning targets for generation also define evaluation targets for grounding.

The thesis contributes methods, datasets, benchmarks, and evaluation protocols that support a unified view of spatially grounded communication in embodied agents, where producing and interpreting meaning are coordinated processes grounded in language, body, and shared physical space.

Abstract [sv]

När en person säger "ställ det där borta" och samtidigt pekar mot en hylla beror betydelsen på den rumsliga relationen mellan talare, lyssnare och delad fysisk omgivning. Förkroppsligade agenter som deltar i sådana interaktioner måste både producera rumsligt grundade gester och tolka multimodala referenser. Trots detta har dessa förmågor till stor del studerats isolerat, med separata data, metoder och utvärderingsparadigm. Denna avhandling argumenterar för att gestgenerering och referentiell grundning är två sidor av samma kommunikativa process, och att ett samlat studium av dem blottlägger struktur som inget av delfälten fångar på egen hand. Argumentet utvecklas genom sju artiklar. På produktionssidan möjliggör kontrastiv tal-rörelse-förträning semantiskt medveten generering av talackompanjerande gester, medan förstärkningsinlärning med adversariella rörelseprior producerar pekgester som är både rumsligt precisa och motoriskt naturliga och överträffar övervakade baslinjer i en perceptuell identifieringsstudie. En flödesmatchningsarkitektur kombinerar vidare semantisk och rumslig konditionering inom ett enda generativt system genom distinkta signalvägar. På förståelsesidan introducerar avhandlingen multimodala konversationsdataset inspelade i virtuell verklighet, vilka kombinerar helkroppsrörelse, blickriktning, tal och 3D-scenkontext. Experiment visar att ledande bild--språkmodeller misslyckas med konversationella referenser inte på grund av bristande perceptuell förmåga, utan för att de inte kan avgöra vad som åsyftas utifrån underspecificerat språk. Ett omskrivningsbaserat frikopplingsexperiment isolerar denna flaskhals: när referenten beskrivs explicit lokaliserar även enkla detektorer den korrekt. Ett centralt resultat som löper genom båda spåren är att semantiskt resonerande, vad som kommuniceras, och rumsligt resonerande, vart det riktas, gynnas av separat arkitektonisk behandling. På produktionssidan styr audiokonditionering gesternas timing medan rumsliga mål bestämmer riktningen; på förståelsesidan identifierar språkligt resonerande referenten medan visuell perception lokaliserar den. I båda fallen överträffar arkitekturer som upprätthåller denna separation dem som sammanför heterogena signaler i en delad representation. En gemensam datainfrastruktur, uppbyggd inkrementellt genom artiklarna, gör denna parallell empiriskt prövbar: samma referensannoteringar som definierar konditioneringsmål för generering definierar även utvärderingsmål för grundning. Avhandlingen bidrar med metoder, dataset, riktmärken och utvärderingsprotokoll som stödjer en enhetlig syn på rumsligt grundad kommunikation i förkroppsligade agenter, där produktion och tolkning av mening är samordnade processer grundade i språk, kropp och delat fysiskt rum.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2026. p. xxi, 103
Series
TRITA-EECS-AVL ; 2026:60
Keywords
embodied agents, multimodal machine learning, diffusion models, multimodal communication, referential grounding
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-382200 (URN)978-91-8106-641-8 (ISBN)
Public defence
2026-06-15, F3, Lindstedtvägen 26, Stockholm, 10:00 (English)
Opponent
Supervisors
Funder
Digital Futures
Note

QC 20260525

Available from: 2026-05-25 Created: 2026-05-25 Last updated: 2026-05-25Bibliographically approved

Open Access in DiVA

fulltext(22812 kB)26 downloads
File information
File name FULLTEXT01.pdfFile size 22812 kBChecksum SHA-512
43e42b0303bba73c30b69c997d638cb1e37647c34ead91e3df824f68cdfa0cad7bd2a07a76d06e200bad03d0e0246abe62dae8d06a591a10b68a6ded590ee2ee
Type fulltextMimetype application/pdf

Other links

arXiv

Authority records

Deichler, AnnaO'Regan, JimBeskow, Jonas

Search in DiVA

By author/editor
Deichler, AnnaO'Regan, JimDavid, JohanssonBeskow, Jonas
By organisation
Speech, Music and HearingKTH
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 369 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf