kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Speech-to-Joy: Self-Supervised Features for Enjoyment Prediction in Human-Robot Conversation
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0009-0002-1152-6457
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-7983-079X
Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg, Gothenburg, Sweden.ORCID iD: 0000-0002-8937-8063
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-8579-1790
Show others and affiliations
2025 (English)In: Proceedings of the 27th International Conference on Multimodal Interaction, ICMI 2025 / [ed] Ram Subramanian, Yukiko I. Nakano, Tom Gedeon, Mohan Kankanhalli, Tanaya Guha, Jainendra Shukla, Gelareh Mohammadi, Oya Celiktutan, Association for Computing Machinery (ACM) , 2025, p. 238-248Conference paper, Published paper (Refereed)
Abstract [en]

Conversational systems that interact or collaborate with people must understand not only task success but also the quality of human experience. We present Speech-to-Joy, a lightweight framework that learns to predict users’ own post-interaction enjoyment ratings using latent embeddings from audio and text modalities. Evaluated on a corpus of human-robot dialogues, the model’s predicted enjoyment correlates strongly and significantly with user self-reports, outperforming both an experienced HRI annotator and heavier LLM-based uni- and multimodal baselines. Notably, even the unimodal audio branch - using only frozen speech embeddings - surpasses all baselines, and a late-fusion of text and audio achieves the highest performance. Designed for real-time inference on resource-limited platforms, Speech-to-Joy replaces ad-hoc emotion heuristics with a direct and user-centered measure of enjoyment. This work paves the way for optimizing interactions with robots and other conversational systems through the lens that matters most: the user’s own experience.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM) , 2025. p. 238-248
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-374889DOI: 10.1145/3716553.3750747Scopus ID: 2-s2.0-105022238812OAI: oai:DiVA.org:kth-374889DiVA, id: diva2:2025314
Conference
The 27th International Conference on Multimodal Interaction, ICMI 2025, Canberra, Australia, October 13-17, 2025
Note

Part of ISBN 979-8-4007-1499-3

QC 20260107

Available from: 2026-01-06 Created: 2026-01-06 Last updated: 2026-01-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Santana, RicardoIrfan, BaharSkantze, GabrielAbelho Pereira, André Tiago

Search in DiVA

By author/editor
Santana, RicardoIrfan, BaharLagerstedt, ErikSkantze, GabrielAbelho Pereira, André Tiago
By organisation
Speech, Music and Hearing, TMHSpeech Communication and Technology
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 13 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf