kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Benchmark for Scene-Aware Referential Gesture Generation
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing.ORCID iD: 0000-0003-3135-5683
Max Planck Institute for Informatics, Saarland Informatics Campus.ORCID iD: 0009-0004-1245-4146
KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, Perception and Learning.ORCID iD: 0000-0002-1733-7019
Max Planck Institute for Informatics, Saarland Informatics Campus.ORCID iD: 0000-0001-5361-8806
Show others and affiliations
2026 (English)Manuscript (preprint) (Other academic)
Abstract [en]

Referential gestures, pointing, indicating, and orienting the body toward objects in shared space, are fundamental to embodied com- munication. For virtual agents and physical robots operating in human environments, the ability to generate spatially grounded gestures is essential for disambiguation, instruction, and collabora- tive interaction. Yet, research on communicative gesture generation has largely focused on co-speech beat and iconic gestures, trained on corpora in which spatial grounding is absent or incidental. This lack of active research on referential gestures can be attributed to three key factors: datasets that pair gestures with 3D scene context are scarce, referential gesture generation lacks task formu- lation, and metrics for evaluating spatial grounding do not exist. In this work, we address all three gaps by introducing the MM- Conv Referential Gesture Generation Challenge. Specifically, the benchmark consists of three components: (i) a paired data release of 3,000 pointing-annotated clips from MM-Conv and SGS-HSI, with pointing-quality annotations and scene-disjoint splits; (ii) a task formulation that requires systems to produce spatially grounded reference gestures aligned with speech, without oracle apex timing or motion templates; (iii) a spatio-temporal evaluation protocol decomposing referential gesture quality into temporal alignment, spatial accuracy, and referent recall. We present a modular baseline based on OmniControl and position the benchmark as the founda- tion for the scene-aware gesture generation challenge at the 1st Workshop on Human–Scene Interaction at ECCV 2026. We envision this challenge as a testbed for the next generation of referential gesture synthesis works. 

Place, publisher, year, edition, pages
2026.
Keywords [en]
gesture generation, scene conditioning, referential gestures, deictic communication, flow matching, spatial grounding, embodied agents, benchmark
National Category
Computer and Information Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-382215OAI: oai:DiVA.org:kth-382215DiVA, id: diva2:2062279
Note

Submitted to the Workshop on Human-Scene Interaction (HSI) at the European Conference on Computer Vision (ECCV) 2026, Sep 2026

QC 20260525

Available from: 2026-05-25 Created: 2026-05-25 Last updated: 2026-05-25Bibliographically approved
In thesis
1. Spatially Grounded Communication in Embodied Agents: From Gesture Generation to Referential Understanding
Open this publication in new window or tab >>Spatially Grounded Communication in Embodied Agents: From Gesture Generation to Referential Understanding
2026 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

When a person says "put that over there" while pointing at a shelf, the meaning depends on the spatial relationship between speaker, listener, and shared physical scene. Embodied agents that participate in such interactions must both produce spatially grounded gestures and interpret multimodal references. Yet these capabilities have largely been studied in isolation, with separate data, methods, and evaluation paradigms.

This thesis argues that gesture generation and referential grounding are two sides of the same communicative process, and that studying them jointly reveals structure that neither subfield surfaces alone. The argument is developed across seven papers. On the production side, contrastive speech-motion pretraining enables semantically aware co-speech gesture generation, while reinforcement learning with adversarial motion priors produces pointing gestures that are both spatially accurate and motorically natural, outperforming supervised baselines in a human referential identification study. A flow-matching architecture further combines semantic and spatial conditioning within a single generative system through distinct pathways.

On the comprehension side, the thesis introduces multimodal conversational datasets recorded in virtual reality and with wearable AR sensors, combining full-body motion, gaze, speech, and 3D scene context. Experiments show that state-of-the-art vision–language models fail on conversational references not for lack of perceptual capability, but because they cannot determine what is being referred to from underspecified language. A rewrite-based decoupling experiment isolates this bottleneck: once the referent is explicitly described, even simple detectors localize it accurately.

A central finding across both threads is that semantic reasoning, what is being communicated, and spatial reasoning, where it is directed, benefit from separate architectural treatment. On the production side, audio conditioning drives gesture timing while spatial targets determine direction; on the comprehension side, linguistic reasoning identifies the referent while visual perception localizes it. In both cases, architectures that maintain this separation outperform those that conflate heterogeneous signals into a shared representation. A shared data infrastructure, built incrementally across the papers, makes this parallel empirically testable: the same referential annotations that define conditioning targets for generation also define evaluation targets for grounding.

The thesis contributes methods, datasets, benchmarks, and evaluation protocols that support a unified view of spatially grounded communication in embodied agents, where producing and interpreting meaning are coordinated processes grounded in language, body, and shared physical space.

Abstract [sv]

När en person säger "ställ det där borta" och samtidigt pekar mot en hylla beror betydelsen på den rumsliga relationen mellan talare, lyssnare och delad fysisk omgivning. Förkroppsligade agenter som deltar i sådana interaktioner måste både producera rumsligt grundade gester och tolka multimodala referenser. Trots detta har dessa förmågor till stor del studerats isolerat, med separata data, metoder och utvärderingsparadigm. Denna avhandling argumenterar för att gestgenerering och referentiell grundning är två sidor av samma kommunikativa process, och att ett samlat studium av dem blottlägger struktur som inget av delfälten fångar på egen hand. Argumentet utvecklas genom sju artiklar. På produktionssidan möjliggör kontrastiv tal-rörelse-förträning semantiskt medveten generering av talackompanjerande gester, medan förstärkningsinlärning med adversariella rörelseprior producerar pekgester som är både rumsligt precisa och motoriskt naturliga och överträffar övervakade baslinjer i en perceptuell identifieringsstudie. En flödesmatchningsarkitektur kombinerar vidare semantisk och rumslig konditionering inom ett enda generativt system genom distinkta signalvägar. På förståelsesidan introducerar avhandlingen multimodala konversationsdataset inspelade i virtuell verklighet, vilka kombinerar helkroppsrörelse, blickriktning, tal och 3D-scenkontext. Experiment visar att ledande bild--språkmodeller misslyckas med konversationella referenser inte på grund av bristande perceptuell förmåga, utan för att de inte kan avgöra vad som åsyftas utifrån underspecificerat språk. Ett omskrivningsbaserat frikopplingsexperiment isolerar denna flaskhals: när referenten beskrivs explicit lokaliserar även enkla detektorer den korrekt. Ett centralt resultat som löper genom båda spåren är att semantiskt resonerande, vad som kommuniceras, och rumsligt resonerande, vart det riktas, gynnas av separat arkitektonisk behandling. På produktionssidan styr audiokonditionering gesternas timing medan rumsliga mål bestämmer riktningen; på förståelsesidan identifierar språkligt resonerande referenten medan visuell perception lokaliserar den. I båda fallen överträffar arkitekturer som upprätthåller denna separation dem som sammanför heterogena signaler i en delad representation. En gemensam datainfrastruktur, uppbyggd inkrementellt genom artiklarna, gör denna parallell empiriskt prövbar: samma referensannoteringar som definierar konditioneringsmål för generering definierar även utvärderingsmål för grundning. Avhandlingen bidrar med metoder, dataset, riktmärken och utvärderingsprotokoll som stödjer en enhetlig syn på rumsligt grundad kommunikation i förkroppsligade agenter, där produktion och tolkning av mening är samordnade processer grundade i språk, kropp och delat fysiskt rum.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2026. p. xxi, 103
Series
TRITA-EECS-AVL ; 2026:60
Keywords
embodied agents, multimodal machine learning, diffusion models, multimodal communication, referential grounding
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-382200 (URN)978-91-8106-641-8 (ISBN)
Public defence
2026-06-15, F3, Lindstedtvägen 26, Stockholm, 10:00 (English)
Opponent
Supervisors
Funder
Digital Futures
Note

QC 20260525

Available from: 2026-05-25 Created: 2026-05-25 Last updated: 2026-05-25Bibliographically approved

Open Access in DiVA

fulltext(464 kB)32 downloads
File information
File name FULLTEXT01.pdfFile size 464 kBChecksum SHA-512
8734e5a0a90839931b9a69f67ecf0f59a31caaad8948bf67472efb6edda55ca2afcb3877b20e9ce8df9d2efef908a5dd9078ce2f27abab733885291171fc35c7
Type fulltextMimetype application/pdf

Other links

Workshop

Authority records

Deichler, AnnaDogan, Fethiye IrmakBeskow, Jonas

Search in DiVA

By author/editor
Deichler, AnnaDabral, RishabhDogan, Fethiye IrmakGhosh, AninditaBeskow, Jonas
By organisation
Speech, Music and HearingRobotics, Perception and Learning
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 415 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf