kth.sePublications KTH
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Spatially Grounded Communication in Embodied Agents: From Gesture Generation to Referential Understanding
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing.ORCID iD: 0000-0003-3135-5683
2026 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

When a person says "put that over there" while pointing at a shelf, the meaning depends on the spatial relationship between speaker, listener, and shared physical scene. Embodied agents that participate in such interactions must both produce spatially grounded gestures and interpret multimodal references. Yet these capabilities have largely been studied in isolation, with separate data, methods, and evaluation paradigms.

This thesis argues that gesture generation and referential grounding are two sides of the same communicative process, and that studying them jointly reveals structure that neither subfield surfaces alone. The argument is developed across seven papers. On the production side, contrastive speech-motion pretraining enables semantically aware co-speech gesture generation, while reinforcement learning with adversarial motion priors produces pointing gestures that are both spatially accurate and motorically natural, outperforming supervised baselines in a human referential identification study. A flow-matching architecture further combines semantic and spatial conditioning within a single generative system through distinct pathways.

On the comprehension side, the thesis introduces multimodal conversational datasets recorded in virtual reality and with wearable AR sensors, combining full-body motion, gaze, speech, and 3D scene context. Experiments show that state-of-the-art vision–language models fail on conversational references not for lack of perceptual capability, but because they cannot determine what is being referred to from underspecified language. A rewrite-based decoupling experiment isolates this bottleneck: once the referent is explicitly described, even simple detectors localize it accurately.

A central finding across both threads is that semantic reasoning, what is being communicated, and spatial reasoning, where it is directed, benefit from separate architectural treatment. On the production side, audio conditioning drives gesture timing while spatial targets determine direction; on the comprehension side, linguistic reasoning identifies the referent while visual perception localizes it. In both cases, architectures that maintain this separation outperform those that conflate heterogeneous signals into a shared representation. A shared data infrastructure, built incrementally across the papers, makes this parallel empirically testable: the same referential annotations that define conditioning targets for generation also define evaluation targets for grounding.

The thesis contributes methods, datasets, benchmarks, and evaluation protocols that support a unified view of spatially grounded communication in embodied agents, where producing and interpreting meaning are coordinated processes grounded in language, body, and shared physical space.

Abstract [sv]

När en person säger "ställ det där borta" och samtidigt pekar mot en hylla beror betydelsen på den rumsliga relationen mellan talare, lyssnare och delad fysisk omgivning. Förkroppsligade agenter som deltar i sådana interaktioner måste både producera rumsligt grundade gester och tolka multimodala referenser. Trots detta har dessa förmågor till stor del studerats isolerat, med separata data, metoder och utvärderingsparadigm. Denna avhandling argumenterar för att gestgenerering och referentiell grundning är två sidor av samma kommunikativa process, och att ett samlat studium av dem blottlägger struktur som inget av delfälten fångar på egen hand. Argumentet utvecklas genom sju artiklar. På produktionssidan möjliggör kontrastiv tal-rörelse-förträning semantiskt medveten generering av talackompanjerande gester, medan förstärkningsinlärning med adversariella rörelseprior producerar pekgester som är både rumsligt precisa och motoriskt naturliga och överträffar övervakade baslinjer i en perceptuell identifieringsstudie. En flödesmatchningsarkitektur kombinerar vidare semantisk och rumslig konditionering inom ett enda generativt system genom distinkta signalvägar. På förståelsesidan introducerar avhandlingen multimodala konversationsdataset inspelade i virtuell verklighet, vilka kombinerar helkroppsrörelse, blickriktning, tal och 3D-scenkontext. Experiment visar att ledande bild--språkmodeller misslyckas med konversationella referenser inte på grund av bristande perceptuell förmåga, utan för att de inte kan avgöra vad som åsyftas utifrån underspecificerat språk. Ett omskrivningsbaserat frikopplingsexperiment isolerar denna flaskhals: när referenten beskrivs explicit lokaliserar även enkla detektorer den korrekt. Ett centralt resultat som löper genom båda spåren är att semantiskt resonerande, vad som kommuniceras, och rumsligt resonerande, vart det riktas, gynnas av separat arkitektonisk behandling. På produktionssidan styr audiokonditionering gesternas timing medan rumsliga mål bestämmer riktningen; på förståelsesidan identifierar språkligt resonerande referenten medan visuell perception lokaliserar den. I båda fallen överträffar arkitekturer som upprätthåller denna separation dem som sammanför heterogena signaler i en delad representation. En gemensam datainfrastruktur, uppbyggd inkrementellt genom artiklarna, gör denna parallell empiriskt prövbar: samma referensannoteringar som definierar konditioneringsmål för generering definierar även utvärderingsmål för grundning. Avhandlingen bidrar med metoder, dataset, riktmärken och utvärderingsprotokoll som stödjer en enhetlig syn på rumsligt grundad kommunikation i förkroppsligade agenter, där produktion och tolkning av mening är samordnade processer grundade i språk, kropp och delat fysiskt rum.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2026. , p. xxi, 103
Series
TRITA-EECS-AVL ; 2026:60
Keywords [en]
embodied agents, multimodal machine learning, diffusion models, multimodal communication, referential grounding
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-382200ISBN: 978-91-8106-641-8 (print)OAI: oai:DiVA.org:kth-382200DiVA, id: diva2:2062127
Public defence
2026-06-15, F3, Lindstedtvägen 26, Stockholm, 10:00 (English)
Opponent
Supervisors
Funder
Digital Futures
Note

QC 20260525

Available from: 2026-05-25 Created: 2026-05-25 Last updated: 2026-05-25Bibliographically approved
List of papers
1. Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
Open this publication in new window or tab >>Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
2023 (English)In: PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, Association for Computing Machinery (ACM) , 2023, p. 755-762Conference paper, Published paper (Refereed)
Abstract [en]

This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing difusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the difusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
Keywords
gesture generation, motion synthesis, difusion models, contrastive pre-training, semantic gestures
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-343773 (URN)10.1145/3577190.3616117 (DOI)001147764700093 ()2-s2.0-85170496681 (Scopus ID)
Conference
25th International Conference on Multimodal Interaction (ICMI), OCT 09-13, 2023, Sorbonne Univ, Paris, FRANCE
Note

Part of proceedings ISBN 979-8-4007-0055-2

QC 20240222

Available from: 2024-02-22 Created: 2024-02-22 Last updated: 2026-05-25Bibliographically approved
2. Learning to generate pointing gestures in situated embodied conversational agents
Open this publication in new window or tab >>Learning to generate pointing gestures in situated embodied conversational agents
2023 (English)In: Frontiers in Robotics and AI, E-ISSN 2296-9144, Vol. 10, article id 1110534Article in journal (Refereed) Published
Abstract [en]

One of the main goals of robotics and intelligent agent research is to enable them to communicate with humans in physically situated settings. Human communication consists of both verbal and non-verbal modes. Recent studies in enabling communication for intelligent agents have focused on verbal modes, i.e., language and speech. However, in a situated setting the non-verbal mode is crucial for an agent to adapt flexible communication strategies. In this work, we focus on learning to generate non-verbal communicative expressions in situated embodied interactive agents. Specifically, we show that an agent can learn pointing gestures in a physically simulated environment through a combination of imitation and reinforcement learning that achieves high motion naturalness and high referential accuracy. We compared our proposed system against several baselines in both subjective and objective evaluations. The subjective evaluation is done in a virtual reality setting where an embodied referential game is played between the user and the agent in a shared 3D space, a setup that fully assesses the communicative capabilities of the generated gestures. The evaluations show that our model achieves a higher level of referential accuracy and motion naturalness compared to a state-of-the-art supervised learning motion synthesis model, showing the promise of our proposed system that combines imitation and reinforcement learning for generating communicative gestures. Additionally, our system is robust in a physically-simulated environment thus has the potential of being applied to robots.

Place, publisher, year, edition, pages
Frontiers Media SA, 2023
Keywords
reinforcement learning, imitation learning, non-verbal communication, embodied interactive agents, gesture generation, physics-aware machine learning
National Category
Human Computer Interaction
Identifiers
urn:nbn:se:kth:diva-326625 (URN)10.3389/frobt.2023.1110534 (DOI)000970385800001 ()37064574 (PubMedID)2-s2.0-85153351800 (Scopus ID)
Note

QC 20230508

Available from: 2023-05-08 Created: 2023-05-08 Last updated: 2026-05-25Bibliographically approved
3. Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents
Open this publication in new window or tab >>Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents
2024 (English)In: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, Association for Computing Machinery (ACM) , 2024, article id 42Conference paper, Published paper (Refereed)
Abstract [en]

This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents’ non-verbal behaviors, specifically gestures. Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void. Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis. We introduce a novel synthetic gesture dataset tailored for this purpose. This development represents a critical step toward creating embodied conversational agents that interact more naturally with their environment and users.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
Co-speech gesture, Deictic gestures, Gesture generation, Situated virtual agents, Synthetic data
National Category
Human Computer Interaction Computer Sciences
Identifiers
urn:nbn:se:kth:diva-359256 (URN)10.1145/3652988.3673936 (DOI)001441957400042 ()2-s2.0-85215524347 (Scopus ID)
Conference
24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, co-located with the Affective Computing and Intelligent Interaction 2024 Conference, ACII 2024, Glasgow, United Kingdom of Great Britain and Northern Ireland, September 16-19, 2024
Note

Part of ISBN 9798400706257

QC 20250203

Available from: 2025-01-29 Created: 2025-01-29 Last updated: 2026-05-25Bibliographically approved
4. Grounded Gesture Generation: Language, Motion and Space
Open this publication in new window or tab >>Grounded Gesture Generation: Language, Motion and Space
Show others...
2025 (English)In: Article, review/survey (Other academic) Accepted
Abstract [en]

Human motion generation has advanced rapidly in recent years, yet the critical problem of creating spatially grounded, context-aware gestures has been largely overlooked. Existing models typically specialize either in descriptive motion generation, such as locomotion and object interaction, or in isolated co-speech gesture synthesis aligned with utterance semantics. However, both lines of work often treat motion and environmental grounding separately, limiting advances toward embodied, communicative agents. To address this gap, our work introduces a multimodal dataset and framework for grounded gesture generation, combining two key resources: (1) a synthetic dataset of spatially grounded referential gestures, and (2) MM-Conv, a VR-based dataset capturing two-party dialogues. Together, they provide over 7.7 hours of synchronized motion, speech, and 3D scene information, standardized in the HumanML3D format. Our framework further connects to a physics-based simulator, enabling synthetic data generation and situated evaluation. By bridging gesture modeling and spatial grounding, our contribution establishes a foundation for advancing research in situated gesture generation and grounded multimodal interaction.

National Category
Computer and Information Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-382213 (URN)
Conference
Workshop on Humanoid Agents (HSI), IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025, Nashville, USA, Jun 11, 2025
Note

QC 20260525

Available from: 2026-05-25 Created: 2026-05-25 Last updated: 2026-05-25Bibliographically approved
5. A Benchmark for Scene-Aware Referential Gesture Generation
Open this publication in new window or tab >>A Benchmark for Scene-Aware Referential Gesture Generation
Show others...
2026 (English)Manuscript (preprint) (Other academic)
Abstract [en]

Referential gestures, pointing, indicating, and orienting the body toward objects in shared space, are fundamental to embodied com- munication. For virtual agents and physical robots operating in human environments, the ability to generate spatially grounded gestures is essential for disambiguation, instruction, and collabora- tive interaction. Yet, research on communicative gesture generation has largely focused on co-speech beat and iconic gestures, trained on corpora in which spatial grounding is absent or incidental. This lack of active research on referential gestures can be attributed to three key factors: datasets that pair gestures with 3D scene context are scarce, referential gesture generation lacks task formu- lation, and metrics for evaluating spatial grounding do not exist. In this work, we address all three gaps by introducing the MM- Conv Referential Gesture Generation Challenge. Specifically, the benchmark consists of three components: (i) a paired data release of 3,000 pointing-annotated clips from MM-Conv and SGS-HSI, with pointing-quality annotations and scene-disjoint splits; (ii) a task formulation that requires systems to produce spatially grounded reference gestures aligned with speech, without oracle apex timing or motion templates; (iii) a spatio-temporal evaluation protocol decomposing referential gesture quality into temporal alignment, spatial accuracy, and referent recall. We present a modular baseline based on OmniControl and position the benchmark as the founda- tion for the scene-aware gesture generation challenge at the 1st Workshop on Human–Scene Interaction at ECCV 2026. We envision this challenge as a testbed for the next generation of referential gesture synthesis works. 

Keywords
gesture generation, scene conditioning, referential gestures, deictic communication, flow matching, spatial grounding, embodied agents, benchmark
National Category
Computer and Information Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-382215 (URN)
Note

Submitted to the Workshop on Human-Scene Interaction (HSI) at the European Conference on Computer Vision (ECCV) 2026, Sep 2026

QC 20260525

Available from: 2026-05-25 Created: 2026-05-25 Last updated: 2026-05-25Bibliographically approved
6. MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue
Open this publication in new window or tab >>MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue
Show others...
2026 (English)In: Language Resources and Evaluation Conference (LREC), ISSN 2522-2686, p. 9240-9253Article, review/survey (Refereed) Published
Abstract [en]

Grounding language in the physical world requires AI systems to interpret references that emerge dynamically during conversation. While current vision-language models (VLMs) excel at static image tasks, they struggle to resolve ambiguous expressions in spontaneous, multi-turn dialogue. We address this gap by introducing MM-Conv—speak, point, look—a benchmark for referential communication in dynamic 3D environments, built from 6.7 hours of egocentric VR interaction with synchronized speech, motion, gaze, and 3D scene geometry. The benchmark includes over 4,200 manually verified referring expressions spanning full, partitive, and pronominal types, enabling systematic evaluation of multimodal reference resolution.

Place, publisher, year, edition, pages
Palma, Spain: ELDA (Evaluations and Language resources Distribution Agency), 2026
Keywords
vision-language models, referential communication, multimodal grounding, egocentric dialogue, embodied AI
National Category
Computer and Information Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-382212 (URN)10.63317/37fzwjphsb9y (DOI)
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026), Palma, Mallorca, Spain, 11–16 May 2026
Note

QC 20260526

Available from: 2026-05-25 Created: 2026-05-25 Last updated: 2026-05-26Bibliographically approved
7. Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
Open this publication in new window or tab >>Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
2025 (English)In: Article, review/survey (Other academic) Accepted
Abstract [en]

We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.

Place, publisher, year, edition, pages
San Diego: , 2025
Keywords
egocentric vision, referential communication, multimodal dataset, gaze-speech synchrony, embodied agents, spatial grounding, augmented reality
National Category
Computer and Information Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-382214 (URN)
Conference
NeurIPS 2025 Workshop onSPACE in Vision, Language, and Embodied AI (SpaVLE), San Diego Convention Center, San Diego, USA, Dec 7th 2025
Note

QC 20260525

Available from: 2026-05-25 Created: 2026-05-25 Last updated: 2026-05-25Bibliographically approved

Open Access in DiVA

fulltext(95397 kB)83 downloads
File information
File name FULLTEXT01.pdfFile size 95397 kBChecksum SHA-512
bbfa425d87b02f4f192210a87a37a8c56889a738ad00d655af80b7c60ffb59d6e581e88545720576e3332ed62380a06a1174fd790b5bc49fe2e85806f6f02ae3
Type fulltextMimetype application/pdf

Authority records

Deichler, Anna

Search in DiVA

By author/editor
Deichler, Anna
By organisation
Speech, Music and Hearing
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 551 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf