Conversational systems that interact or collaborate with people must understand not only task success but also the quality of human experience. We present Speech-to-Joy, a lightweight framework that learns to predict users’ own post-interaction enjoyment ratings using latent embeddings from audio and text modalities. Evaluated on a corpus of human-robot dialogues, the model’s predicted enjoyment correlates strongly and significantly with user self-reports, outperforming both an experienced HRI annotator and heavier LLM-based uni- and multimodal baselines. Notably, even the unimodal audio branch - using only frozen speech embeddings - surpasses all baselines, and a late-fusion of text and audio achieves the highest performance. Designed for real-time inference on resource-limited platforms, Speech-to-Joy replaces ad-hoc emotion heuristics with a direct and user-centered measure of enjoyment. This work paves the way for optimizing interactions with robots and other conversational systems through the lens that matters most: the user’s own experience.
Part of ISBN 979-8-4007-1499-3
QC 20260107