kth.sePublications
Change search
Refine search result
1234567 1 - 50 of 1494
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Borg, Alexander
    et al.
    Karolinska Intitute Stockholm, Sweden.
    Parodis, Ioannis
    Karolinska Intitute Stockholm, Sweden.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Creating Virtual Patients using Robots and Large Language Models: A Preliminary Study with Medical Students2024In: HRI 2024 Companion - Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Association for Computing Machinery (ACM) , 2024, p. 273-277Conference paper (Refereed)
    Abstract [en]

    This paper presents a virtual patient (VP) platform for medical education, combining a social robot, Furhat, with large language models (LLMs). Aimed at enhancing clinical reasoning (CR) training, particularly in rheumatology, this approach introduces more interactive and realistic patient simulations. The use of LLMs both for driving the dialogue, but also for the expression of emotions in the robot's face, as well as automatic analysis and generation of feedback to the student, is discussed. The platform's effectiveness was tested in a pilot study with 15 medical students, comparing it against a traditional semi-linear VP platform. The evaluation indicates a preference for the robot platform in terms of authenticity and learning effect. We conclude that this novel integration of a social robot and LLMs in VP simulations shows potential in medical education, offering a more engaging learning experience.

  • 2.
    Sundberg, Johan
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Salomão, Gláucia Laís
    Stockholm University Brain Imaging Centre (SUBIC), Department of Linguistics, Stockholm University, Stockholm, Sweden.
    Scherer, Klaus R.
    Department of Psychology, University of Geneva, Geneva, Switzerland.
    Emotional expressivity in singing: Assessing physiological and acoustic indicators of two opera singers' voice characteristics2024In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 155, no 1, p. 18-28Article in journal (Refereed)
    Abstract [en]

    In an earlier study, we analyzed how audio signals obtained from three professional opera singers varied when they sang one octave wide eight-tone scales in ten different emotional colors. The results showed systematic variations in voice source and long-term-average spectrum (LTAS) parameters associated with major emotion “families”. For two of the singers, subglottal pressure (PSub) also was recorded, thus allowing analysis of an additional main physiological voice control parameter, glottal resistance (defined as the ratio between PSub and glottal flow), and related to glottal adduction. In the present study, we analyze voice source and LTAS parameters derived from the audio signal and their correlation with Psub and glottal resistance. The measured parameters showed a systematic relationship with the four emotion families observed in our previous study. They also varied systematically with values of the ten emotions along the valence, power, and arousal dimensions; valence showed a significant correlation with the ratio between acoustic voice source energy and subglottal pressure, while Power varied significantly with sound level and two measures related to the spectral dominance of the lowest spectrum partial. the fundamental.

  • 3.
    Wolfert, Pieter
    et al.
    Univ Ghent, IDLab Airo, imec, B-9052 Ghent, Belgium.;Radboud Univ Nijmegen, Donders Inst Brain Cognit & Behav, NL-6500 HB Nijmegen, Netherlands..
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Belpaeme, Tony
    Univ Ghent, IDLab Airo, imec, B-9052 Ghent, Belgium..
    Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal Behaviour2024In: Applied Sciences, E-ISSN 2076-3417, Vol. 14, no 4, article id 1460Article in journal (Refereed)
    Abstract [en]

    This paper compares three methods for evaluating computer-generated motion behaviour for animated characters: two commonly used direct rating methods and a newly designed questionnaire. The questionnaire is specifically designed to measure the human-likeness, appropriateness, and intelligibility of the generated motion. Furthermore, this study investigates the suitability of these evaluation tools for assessing subtle forms of human behaviour, such as the subdued motion cues shown when listening to someone. This paper reports six user studies, namely studies that directly rate the appropriateness and human-likeness of a computer character's motion, along with studies that instead rely on a questionnaire to measure the quality of the motion. As test data, we used the motion generated by two generative models and recorded human gestures, which served as a gold standard. Our findings indicate that when evaluating gesturing motion, the direct rating of human-likeness and appropriateness is to be preferred over a questionnaire. However, when assessing the subtle motion of a computer character, even the direct rating method yields less conclusive results. Despite demonstrating high internal consistency, our questionnaire proves to be less sensitive than directly rating the quality of the motion. The results provide insights into the evaluation of human motion behaviour and highlight the complexities involved in capturing subtle nuances in nonverbal communication. These findings have implications for the development and improvement of motion generation models and can guide researchers in selecting appropriate evaluation methodologies for specific aspects of human behaviour.

  • 4.
    Ashkenazi, Shaul
    et al.
    University of Glasgow Glasgow, UK.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Stuart-Smith, Jane
    University of Glasgow Glasgow, UK.
    Foster, Mary Ellen
    University of Glasgow Glasgow, UK.
    Goes to the Heart: Speaking the User's Native Language2024In: HRI 2024 Companion - Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Association for Computing Machinery (ACM) , 2024, p. 214-218Conference paper (Refereed)
    Abstract [en]

    We are developing a social robot to work alongside human support workers who help new arrivals in a country to navigate the necessary bureaucratic processes in that country. The ultimate goal is to develop a robot that can support refugees and asylum seekers in the UK. As a first step, we are targeting a less vulnerable population with similar support needs: international students in the University of Glasgow. As the target users are in a new country and may be in a state of stress when they seek support, forcing them to communicate in a foreign language will only fuel their anxiety, so a crucial aspect of the robot design is that it should speak the users' native language if at all possible. We provide a technical description of the robot hardware and software, and describe the user study that will shortly be carried out. At the end, we explain how we are engaging with refugee support organisations to extend the robot into one that can also support refugees and asylum seekers.

  • 5.
    Irfan, Bahar
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Staffa, Mariacarla
    University of Naples Parthenope, Italy.
    Bobu, Andreea
    Boston Dynamics AI Institute, USA.
    Churamani, Nikhil
    University of Cambridge, UK.
    Lifelong Learning and Personalization in Long-Term Human-Robot Interaction (LEAP-HRI): Open-World Learning2024In: HRI 2024 Companion - Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Association for Computing Machinery (ACM) , 2024, p. 1323-1325Conference paper (Refereed)
    Abstract [en]

    The complex and largely unstructured nature of real-world situations makes it challenging for conventional closed-world robot learning solutions to adapt to such interaction dynamics. These challenges become particularly pronounced in long-term interactions where robots need to go beyond their past learning to continuously evolve with changing environment settings and personalize towards individual user behaviors. In contrast, open-world learning embraces the complexity and unpredictability of the real world, enabling robots to be “lifelong learners” that continuously acquire new knowledge and navigate novel challenges, making them more context-aware while intuitively engaging the users. Adopting the theme of “open-world learning”, the fourth edition of the “Lifelong Learning and Personalization in Long-Term Human-Robot Interaction (LEAP-HRI)”1 workshop seeks to bring together interdisciplinary perspectives on real-world applications in human-robot interaction (HRI), including education, rehabilitation, elderly care, service, and companionship. The goal of the workshop is to foster collaboration and understanding across diverse scientific communities through invited keynote presentations and in-depth discussions facilitated by contributed talks, a break-out session, and a debate.

  • 6.
    Cumbal, Ronald
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Robots Beyond Borders: The Role of Social Robots in Spoken Second Language Practice2024Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    This thesis investigates how social robots can support adult second language (L2) learners in improving conversational skills. It recognizes the challenges inherent in adult L2 learning, including increased cognitive demands and the unique motivations driving adult education. While social robots hold potential for natural interactions and language education, research into conversational skill practice with adult learners remains underexplored. Thus, the thesis contributes to understanding these conversational dynamics, enhancing speaking practice, and examining cultural perspectives in this context.

    To begin, this thesis investigates robot-led conversations with L2 learners, examining how learners respond to moments of uncertainty. The research reveals that when faced with uncertainty, learners frequently seek clarification, yet many remain unresponsive. As a result, effective strategies are required from robot conversational partners to address this challenge. These interactions are then used to evaluate the performance of off-the-shelf Automatic Speech Recognition (ASR) systems. The assessment highlights that speech recognition for L2 speakers is not as effective as for L1 speakers, with performance deteriorating for both groups during social conversations. Addressing these challenges is imperative for the successful integration of robots in conversational practice with L2 learners.

    The thesis then explores the potential advantages of employing social robots in collaborative learning environments with multi-party interactions. It delves into strategies for improving speaking practice, including the use of non-verbal behaviors to encourage learners to speak. For instance, a robot's adaptive gazing behavior is used to effectively balance speaking contributions between L1 and L2 pairs of participants. Moreover, an adaptive use of encouraging backchannels significantly increases the speaking time of L2 learners.

    Finally, the thesis highlights the importance of further research on cultural aspects in human-robot interactions. One study reveals distinct responses among various socio-cultural groups in interaction between L1 and L2 participants. For example, factors such as gender, age, extroversion, and familiarity with robots influence conversational engagement of L2 speakers. Additionally, another study investigates preconceptions related to the appearance and accents of nationality-encoded (virtual and physical) social robots. The results indicate that initial perceptions may lead to negative preconceptions, but that these perceptions diminish after actual interactions.

    Despite technical limitations, social robots provide distinct benefits in supporting educational endeavors. This thesis emphasizes the potential of social robots as effective facilitators of spoken language practice for adult learners, advocating for continued exploration at the intersection of language education, human-robot interaction, and technology.

    Download full text (pdf)
    Kappa
  • 7.
    Axelsson, Agnes
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Vaddadi, Bhavana
    KTH, School of Industrial Engineering and Management (ITM), Centres, Integrated Transport Research Lab, ITRL.
    Bogdan, Cristian M
    KTH, School of Electrical Engineering and Computer Science (EECS), Human Centered Technology, Media Technology and Interaction Design, MID.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Robots in autonomous buses: Who hosts when no human is there?2024In: HRI 2024 Companion - Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Association for Computing Machinery (ACM) , 2024, p. 1278-1280Conference paper (Refereed)
    Abstract [en]

    In mid-2023, we performed an experiment in autonomous buses in Stockholm, Sweden, to evaluate the role that social robots might have in such settings, and their effects on passengers' feeling of safety and security, given the absence of human drivers or clerks. To address the situations that may occur in autonomous public transit (APT), we compared an embodied agent to a disembodied agent. In this video publication, we showcase some of the things that worked with the interactions we created, and some problematic issues that we had not anticipated.

  • 8.
    Mehta, Shivam
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. Department of Industrial and Materials Science, Chalmers University of Technology, Rannvagen 2A, Göteborg, 41296 Sweden, Rännvägen 2A.
    Frisk, K.
    Department of Industrial and Materials Science, Chalmers University of Technology, Rannvagen 2A, Göteborg, 41296 Sweden, Rännvägen 2A.
    Nyborg, L.
    Department of Industrial and Materials Science, Chalmers University of Technology, Rannvagen 2A, Göteborg, 41296 Sweden, Rännvägen 2A.
    Role of Cr in Mn-rich precipitates for Al–Mn–Cr–Zr-based alloys tailored for additive manufacturing2024In: Calphad, ISSN 0364-5916, E-ISSN 1873-2984, Vol. 84, article id 102667Article in journal (Refereed)
    Abstract [en]

    Novel alloy concepts enabled via additive manufacturing processes have opened up the possibility of tailoring properties beyond the scope of conventional casting and powder metallurgy processes. The authors have previously presented a novel Al–Mn–Cr–Zr-based alloy system containing three times the equilibrium amounts of Mn and Zr. The alloys were produced via a powder bed fusion-laser beam (PBF-LB) process taking advantage of rapid cooling and solidification characteristics of the process. This supersaturation can then be leveraged to provide high precipitation hardening via direct ageing heat treatments. The hardening is enabled with Zr-rich and Mn-rich precipitates. Literature study confirms that Mn-rich precipitates have a notable solubility of Cr, for example, the Al12Mn precipitate. This study aims to clarify the effect of Cr solubility in the thermodynamics and kinetics simulation and compare the precipitation simulations with samples subject to >1000 h isothermal heat treatment, thus creating an equilibrium-like state. The results show that Cr addition to the precipitates stabilizes the Al12Mn precipitate while slowing the precipitation kinetics thus producing a favourable hardening response. Such observations could be insightful while designing such alloys and optimising heat treatments of the current or even a future alloy system.

  • 9.
    Cumbal, Ronald
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Engwall, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Speaking Transparently: Social Robots in Educational Settings2024In: Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction (HRI '24 Companion), March 11--14, 2024, Boulder, CO, USA, 2024Conference paper (Refereed)
    Abstract [en]

    The recent surge in popularity of Large Language Models, known for their inherent opacity, has increased the interest in fostering transparency in technology designed for human interaction. This concern is equally prevalent in the development of Social Robots, particularly when these are designed to engage in critical areas of our society, such as education or healthcare. In this paper we propose an experiment to investigate how users can be made aware of the automated decision processes when interacting in a discussion with a social robot. Our main objective is to assess the effectiveness of verbal expressions in fostering transparency within groups of individuals as they engage with a robot. We describe the proposed interactive settings, system design, and our approach to enhance the transparency in a robot's decision-making process for multi-party interactions.

    Download full text (pdf)
    fulltext
  • 10.
    Traum, David
    et al.
    University of Southern California, USA.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Nishizaki, Hiromitsu
    University of Yamanashi, Japan.
    Higashinaka, Ryuichiro
    Nagoya University, Japan.
    Minato, Takashi
    RIKEN/ATR, Japan.
    Nagai, Takayuki
    Osaka University, Japan.
    Special issue on multimodal processing and robotics for dialogue systems (Part II)2024In: Advanced Robotics, ISSN 0169-1864, E-ISSN 1568-5535, Vol. 38, no 4, p. 193-194Article in journal (Other academic)
  • 11.
    Kamelabad, Alireza M.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    The Qestion Is Not Whether; It Is How!2024In: HRI 2024 Companion - Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Association for Computing Machinery (ACM) , 2024, p. 112-114Conference paper (Refereed)
    Abstract [en]

    This submission explores the implications of robot embodiment in language learning. Through various innovative studies, it investigates how factors tied to robot usage, such as personality characteristics and learning settings, influence learner outcomes. It incorporates advancements in artificial intelligence by utilizing large language models and further contributes to pivotal understanding through a planned longitudinal study in the migrant context. Lastly, an intensive speech analysis further examines the specifics of human-robot interaction.

  • 12.
    Ternström, Sten
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Update 3.1 to FonaDyn: A system for real-time analysis of the electroglottogram, over the voice range2024In: SoftwareX, E-ISSN 2352-7110, Vol. 26Article in journal (Refereed)
    Abstract [en]

    The human voice is notoriously variable, and conventional measurement paradigms are weak in terms of providing evidence for effects of treatment and/or training of voices. New methods are needed that can take into account the variability of metrics and types of phonation across the voice range. The “voice map” is a generalization of the Voice Range Profile (a.k.a. the phonetogram), with the potential to be used in many ways, for teaching, training, therapy and research. FonaDyn is intended as a proof-of concept workbench for education and research on phonation, and for exploring and validating the analysis paradigm of voice-mapping. Version 3.1 of the FonaDyn system adds many new functions, including listening from maps; displaying multiple maps and difference maps to track effects of voice interventions; smoothing/interpolation of voice maps; clustering not only of EGG shapes but also of acoustic and EGG metrics into phonation types; extended multichannel acquisition;24-bit recording with optional max 140 dB SPL; a built-in SPL calibration and signal diagnostics tool; EGG noise suppression; more Matlab integration; script control; the acoustic metrics Spectrum Balance, Cepstral Peak Prominence and Harmonic Richness Factor (of the EGG); and better window layout control. Stability and usability are further improved. Apple M-series processors are now supported natively.

    Download full text (pdf)
    fulltext
  • 13.
    Wang, Siyang
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Gustafsson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Székely, Éva
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS2023In: ICASSPW 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper (Refereed)
    Abstract [en]

    Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts

  • 14.
    Wang, Siyang
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Gustafsson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Székely, Éva
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    A comparative study of self-supervised speech representationsin read and spontaneous TTS2023Manuscript (preprint) (Other academic)
    Abstract [en]

    Recent work has explored using self-supervised learning(SSL) speech representations such as wav2vec2.0 as the rep-resentation medium in standard two-stage TTS, in place ofconventionally used mel-spectrograms. It is however unclearwhich speech SSL is the better fit for TTS, and whether ornot the performance differs between read and spontaneousTTS, the later of which is arguably more challenging. Thisstudy aims at addressing these questions by testing severalspeech SSLs, including different layers of the same SSL, intwo-stage TTS on both read and spontaneous corpora, whilemaintaining constant TTS model architecture and trainingsettings. Results from listening tests show that the 9th layerof 12-layer wav2vec2.0 (ASR finetuned) outperforms othertested SSLs and mel-spectrogram, in both read and sponta-neous TTS. Our work sheds light on both how speech SSL canreadily improve current TTS systems, and how SSLs comparein the challenging generative task of TTS. Audio examplescan be found at https://www.speech.kth.se/tts-demos/ssr tts

    Download full text (pdf)
    fulltext
  • 15.
    Nyatsanga, S.
    et al.
    University of California, Davis, USA.
    Kucherenko, T.
    SEED - Electronic Arts, Stockholm, Sweden.
    Ahuja, C.
    Meta AI, USA.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Neff, M.
    University of California, Davis, USA.
    A Comprehensive Review of Data-Driven Co-Speech Gesture Generation2023In: Computer graphics forum (Print), ISSN 0167-7055, E-ISSN 1467-8659, Vol. 42, no 2, p. 569-596Article in journal (Refereed)
    Abstract [en]

    Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co-speech gestures is a long-standing problem in computer animation and is considered an enabling technology for creating believable characters in film, games, and virtual social spaces, as well as for interaction with social robots. The problem is made challenging by the idiosyncratic and non-periodic nature of human co-speech gesture motion, and by the great diversity of communicative functions that gestures encompass. The field of gesture generation has seen surging interest in the last few years, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep-learning-based generative models that benefit from the growing availability of data. This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule-based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text and non-linguistic input. Concurrent with the exposition of deep learning approaches, we chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method (e.g., optical motion capture or pose estimation from video). Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human-like motion; grounding the gesture in the co-occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development.

  • 16.
    Amerotti, Marco
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Benford, Steve
    University of Nottingham.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Vear, Craig
    University of Nottingham.
    A Live Performance Rule System Informed by Irish Traditional Dance Music2023In: Proc. International Symposium on Computer Music Multidisciplinary Research, 2023Conference paper (Refereed)
    Abstract [en]

    This paper describes ongoing work in programming a live performance system for interpreting melodies in ways that mimic Irish traditional dance music practice, and thatallows plug and play human interaction. Existing performance systemsare almost exclusively aimed at piano performance and classical music, and noneare aimed specifically at traditional music.We develop a rule-based approach using expert knowledgethat converts a melody into control parametersto synthesize an expressive MIDI performance,focusing on ornamentation, dynamics and subtle time deviation.Furthermore, we make the system controllable (e.g., via knobs or expression pedals) such that it can be controlled in real time by a musician.Our preliminary evaluations show the systemcan render expressive performances mimicking traditional practice, and allows for engaging withIrish traditional dance music in new ways. We provide several examples online.

    Download full text (pdf)
    fulltext
  • 17.
    Pérez Zarazaga, Pablo
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    A processing framework to access large quantities of whispered speech found in ASMR2023In: ICASSP 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece: IEEE Signal Processing Society, 2023Conference paper (Refereed)
    Abstract [en]

    Whispering is a ubiquitous mode of communication that humansuse daily. Despite this, whispered speech has been poorly servedby existing speech technology due to a shortage of resources andprocessing methodology. To remedy this, this paper provides a pro-cessing framework that enables access to large and unique data ofhigh-quality whispered speech. We obtain the data from recordingssubmitted to online platforms as part of the ASMR media-culturalphenomenon. We describe our processing pipeline and a method forimproved whispered activity detection (WAD) in the ASMR data.To efficiently obtain labelled, clean whispered speech, we comple-ment the automatic WAD by using Edyson, a bulk audio annotationtool with human-in-the-loop. We also tackle a problem particular toASMR: separation of whisper from other acoustic triggers presentin the genre. We show that the proposed WAD and the efficient la-belling allows to build extensively augmented data and train a clas-sifier that extracts clean whisper segments from ASMR audio.Our large and growing dataset enables whisper-capable, data-driven speech technology and linguistic analysis. It also opens op-portunities in e.g. HCI as a resource that may elicit emotional, psy-chological and neuro-physiological responses in the listener.

    Download full text (pdf)
    fulltext
  • 18.
    Sturm, Bob
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Flexer, Arthur
    JKU Linz.
    A Review of Validity and its Relationship to Music Information Research2023In: Proc. Int. Symp. Music Information Retrieval, 2023Conference paper (Refereed)
    Abstract [en]

    Validity is the truth of an inference made from evidence and is a central concern in scientific work. Given the maturity of the domain of music information research (MIR), validity in our opinion should be discussed and considered much more than it has been so far. Puzzling MIR phenomena like adversarial attacks, horses, and performance glass ceilings become less mysterious through the lens of validity. In this paper, we review the subject of validity as presented in a key reference of causal inference: Shadish et al., "Experimental and Quasi-experimental Designs for Generalised Causal Inference". We discuss the four types of validity and threats to each one. We consider them in relationship to MIR experiments grounded with a practical demonstration using a typical MIR experiment. 

    Download full text (pdf)
    fulltext
  • 19.
    Peña, Paola Raquel
    et al.
    University College Dublin, Dublin 4, Ireland.
    Doyle, Philip R.
    University College Dublin, Dublin 4, Ireland.
    Ip, Emily Yj
    Trinity College Dublin, Dublin, Ireland.
    Di Liberto, Giovanni
    Trinity College Dublin, Dublin, Ireland.
    Higgins, Darragh
    Trinity College Dublin, Dublin, Ireland.
    McDonnell, Rachel
    Trinity College Dublin, Dublin, Ireland.
    Branigan, Holly
    University of Edinburgh, Edinburgh, United Kingdom.
    Gustafsson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    McMillan, Donald
    Stockholm University, Stockholm, Sweden.
    Moore, Robert J.
    IBM Research-Almaden Lab, San Jose, United States of America.
    Cowan, Benjamin R.
    University College Dublin, Dublin 4, Ireland.
    A Special Interest Group on Developing Theories of Language Use in Interaction with Conversational User Interfaces2023In: CHI 2023: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery (ACM) , 2023, article id 509Conference paper (Refereed)
  • 20.
    Adiban, Mohammad
    et al.
    NTNU, Dept Elect Syst, Trondheim, Norway.;Monash Univ, Dept Human Centred Comp, Melbourne, Australia..
    Siniscalchi, Sabato Marco
    NTNU, Dept Elect Syst, Trondheim, Norway..
    Salvi, Giampiero
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. NTNU, Dept Elect Syst, Trondheim, Norway..
    A step-by-step training method for multi generator GANs with application to anomaly detection and cybersecurity2023In: Neurocomputing, ISSN 0925-2312, E-ISSN 1872-8286, Vol. 537, p. 296-308Article in journal (Refereed)
    Abstract [en]

    Cyber attacks and anomaly detection are problems where the data is often highly unbalanced towards normal observations. Furthermore, the anomalies observed in real applications may be significantly different from the ones contained in the training data. It is, therefore, desirable to study methods that are able to detect anomalies only based on the distribution of the normal data. To address this problem, we propose a novel objective function for generative adversarial networks (GANs), referred to as STEPGAN. STEP-GAN simulates the distribution of possible anomalies by learning a modified version of the distribution of the task-specific normal data. It leverages multiple generators in a step-by-step interaction with a discriminator in order to capture different modes in the data distribution. The discriminator is optimized to distinguish not only between normal data and anomalies but also between the different generators, thus encouraging each generator to model a different mode in the distribution. This reduces the well-known mode collapse problem in GAN models considerably. We tested our method in the areas of power systems and network traffic control systems (NTCSs) using two publicly available highly imbalanced datasets, ICS (Industrial Control System) security dataset and UNSW-NB15, respectively. In both application domains, STEP-GAN outperforms the state-of-the-art systems as well as the two baseline systems we implemented as a comparison. In order to assess the generality of our model, additional experiments were carried out on seven real-world numerical datasets for anomaly detection in a variety of domains. In all datasets, the number of normal samples is significantly more than that of abnormal samples. Experimental results show that STEP-GAN outperforms several semi-supervised methods while being competitive with supervised methods.

  • 21.
    Axelsson, Agnes
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Adaptive Robot Presenters: Modelling Grounding in Multimodal Interaction2023Doctoral thesis, monograph (Other academic)
    Abstract [en]

    This thesis addresses the topic of grounding in human-robot interaction, that is, the process by which the human and robot can ensure mutual understanding. To explore this topic, the scenario of a robot holding a presentation to a human audience is used, where the robot has to process multimodal feedback from the human in order to adapt the presentation to the human's level of understanding.

    First, the use of behaviour trees to model real-time interactive processes of the presentation is addressed. A system based on the behaviour tree architecture is used in a semi-automated Wizard-of-oz experiment, showing that audience members prefer an adaptive system to a non-adaptive alternative.

    Next, the thesis addresses the use of knowledge graphs to represent the content of the presentation given by the robot. By building a small, local knowledge graph containing properties (edges) that represent facts about the presentation, the system can iterate over that graph and consistently find ways to refer to entities by referring to previously grounded content. A system based on this architecture is implemented, and an evaluation using simulated users is presented. The results show that crowdworkers comparing different adaptation strategies are sensitive to the types of adaptation enabled by the knowledge graph approach.

    In a face-to-face presentation setting, feedback from the audience can potentially be expressed through various modalities, including speech, head movements, gaze, facial gestures and body pose. The thesis explores how such feedback can be automatically classified. A corpus of human-robot interactions is annotated, and models are trained to classify human feedback as positive, negative or neutral. A relatively high accuracy is achieved by training simple classifiers with signals found mainly in the speech and head movements.

    When knowledge graphs are used as the underlying representation of the system's presentation, some consistent way of generating text, that can be turned into speech, is required. This graph-to-text problem is explored by proposing several methods, both template-based and methods based on zero-shot generation using large language models (LLMs). A novel evaluation method using a combination of factual, counter-factual and fictional graphs is proposed. 

    Finally, the thesis presents and evaluates a fully automated system using all of the components above. The results show that audience members prefer the adaptive system to a non-adaptive system, matching the results from the beginning of the thesis. However, we note that clear learning results are not found, which means that the entertainment aspects of the presentation are perhaps more prominent than the learning aspects.

    Download full text (pdf)
    Fulltext
  • 22.
    Wolfert, Pieter
    et al.
    IDLab-AIRO -Ghent University Ghent, Belgium.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Belpaeme, Tony
    IDLab-AIRO -Ghent University Ghent, Belgium.
    "Am I listening?", Evaluating the Quality of Generated Data-driven Listening Motion2023In: ICMI 2023 Companion: Companion Publication of the 25th International Conference on Multimodal Interaction, Association for Computing Machinery (ACM) , 2023, p. 6-10Conference paper (Refereed)
    Abstract [en]

    This paper asks if recent models for generating co-speech gesticulation also may learn to exhibit listening behaviour as well. We consider two models from recent gesture-generation challenges and train them on a dataset of audio and 3D motion capture from dyadic conversations. One model is driven by information from both sides of the conversation, whereas the other only uses the character's own speech. Several user studies are performed to assess the motion generated when the character is speaking actively, versus when the character is the listener in the conversation. We find that participants are reliably able to discern motion associated with listening, whether from motion capture or generated by the models. Both models are thus able to produce distinctive listening behaviour, even though only one model is truly a listener, in the sense that it has access to information from the other party in the conversation. Additional experiments on both natural and model-generated motion finds motion associated with listening to be rated as less human-like than motion associated with active speaking.

  • 23.
    Cao, Xinwei
    et al.
    Department of Electronic Systems, NTNU, Norway.
    Fan, Zijian
    Department of Electronic Systems, NTNU, Norway.
    Svendsen, Torbjørn
    Department of Electronic Systems, NTNU, Norway.
    Salvi, Giampiero
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. Department of Electronic Systems, NTNU, Norway.
    An Analysis of Goodness of Pronunciation for Child Speech2023In: Interspeech 2023, International Speech Communication Association , 2023, p. 4613-4617Conference paper (Refereed)
    Abstract [en]

    In this paper, we study the use of goodness of pronunciation (GOP) on child speech. We first compare the distributions of GOP scores on several open datasets representing various dimensions of speech variability. We show that the GOP distribution over CMU Kids, corresponding to young age, has larger spread than those on datasets representing other dimensions, i.e., accent, dialect, spontaneity and environmental conditions. We hypothesize that the increased variability of pronunciation in young age may impair the use of traditional mispronunciation detection methods for children. To support this hypothesis, we perform simulated mispronunciation experiments both for children and adults using different variants of the GOP algorithm. We also compare the results to real-case mispronunciations for native children showing that GOP is less effective for child speech than for adult speech.

  • 24.
    Tånnander, Christina
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    House, David
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Analysis-by-synthesis: phonetic-phonological variation indeep neural network-based text-to-speech synthesis2023In: Proceedings of the 20th International Congress of Phonetic Sciences, Prague 2023 / [ed] Radek Skarnitzl and Jan Volín, Prague, Czech Republic: GUARANT International , 2023, p. 3156-3160Conference paper (Refereed)
    Abstract [en]

    Text-to-speech synthesis based on deep neuralnetworks can generate highly humanlike speech,which revitalizes the potential for analysis-bysynthesis in speech research. We propose that neuralsynthesis can provide evidence that a specificdistinction in its transcription system represents arobust acoustic/phonetic distinction in the speechused to train the model.We synthesized utterances with allophones inincorrect contexts and analyzed the resultsphonetically. Our assumption was that if we gainedcontrol over the allophonic variation in this way, itwould provide strong evidence that the variation isgoverned robustly by the phonological context usedto create the transcriptions.Of three allophonic variations investigated, thefirst, which was believed to be quite robust, gave usrobust control over the variation, while the other two,which are less categorical, did not afford us suchcontrol. These findings are consistent with ourhypothesis and support the notion that neural TTS canbe a valuable analysis-by-synthesis tool for speechresearch. 

    Download full text (pdf)
    fulltext
  • 25.
    Kalpakchi, Dmytro
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Ask and distract: Data-driven methods for the automatic generation of multiple-choice reading comprehension questions from Swedish texts2023Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Multiple choice questions (MCQs) are widely used for summative assessment in many different subjects. The tasks in this format are particularly appealing because they can be graded swiftly and automatically. However, the process of creating MCQs is far from swift or automatic and requires a lot of expertise both in the specific subject and also in test construction.

    This thesis focuses on exploring methods for the automatic MCQ generation for assessing the reading comprehension abilities of second-language learners of Swedish. We lay the foundations for the MCQ generation research for Swedish by collecting two datasets of reading comprehension MCQs, and designing and developing methods for generating the whole MCQs or its parts. An important contribution is the methods (which were designed and applied in practice) for the automatic and human evaluation of the generated MCQs.

    The best currently available method (as of June 2023) for generating MCQs for assessing reading comprehension in Swedish is ChatGPT (although still only around 60% of generated MCQs were judged acceptable). However, ChatGPT is neither open-source, nor free. The best open-source and free-to-use method is the fine-tuned version of SweCTRL-Mini, a foundational model developed as a part of this thesis. Nevertheless, all explored methods are far from being useful in practice but the reported results provide a good starting point for future research.

    Download full text (pdf)
    Summary
  • 26.
    Ekstedt, Erik
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Wang, Siyang
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Székely, Éva
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Gustafsson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis2023In: Interspeech 2023, International Speech Communication Association , 2023, p. 5481-5485Conference paper (Refereed)
    Abstract [en]

    Turn-taking is a fundamental aspect of human communication where speakers convey their intention to either hold, or yield, their turn through prosodic cues. Using the recently proposed Voice Activity Projection model, we propose an automatic evaluation approach to measure these aspects for conversational speech synthesis. We investigate the ability of three commercial, and two open-source, Text-To-Speech (TTS) systems ability to generate turn-taking cues over simulated turns. By varying the stimuli, or controlling the prosody, we analyze the models performances. We show that while commercial TTS largely provide appropriate cues, they often produce ambiguous signals, and that further improvements are possible. TTS, trained on read or spontaneous speech, produce strong turn-hold but weak turn-yield cues. We argue that this approach, that focus on functional aspects of interaction, provides a useful addition to other important speech metrics, such as intelligibility and naturalness.

  • 27.
    Falk, Simon
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Ahlbäck, Sven
    DoReMIR Music Research AB.
    Automatic legato transcription based on onset detection2023In: SMC 2023: Proceedings of the Sound and Music Computing Conference 2023, Sound and Music Computing Network , 2023, p. 214-221Conference paper (Refereed)
    Abstract [en]

    This paper focuses on the transcription of performance expression and in particular, legato slurs for solo violin performance. This can be used to improve automatic music transcription and enrich the resulting notations with expression markings. We review past work in expression detection, and find that while legato detection has been explored its transcription has not. We propose a method for demarcating the beginning and ending of slurs in a performance by combining pitch and onset information produced by ScoreCloud (a music notation software with transcription capabilities) with articulated onsets detected by a convolutional neural network. To train this system, we build a dataset of solo bowed violin performance featuring three different musicians playing several exercises and tunes. We test the resulting method on a small collection of recordings of the same excerpt of music performed by five different musicians. We find that this signal-based method works well in cases where the acoustic conditions do not interfere largely with the onset strengths. Further work will explore data augmentation for making the articulation detection more robust, as well as an end-to-end solution. 

    Download full text (pdf)
    fulltext
  • 28.
    Leijon, Arne
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    von Gablenz, Petra
    Institute of Hearing Technology and Audiology, Jade University of Applied Sciences, Oldenburg, Germany.
    Holube, Inga
    Institute of Hearing Technology and Audiology, Jade University of Applied Sciences, Oldenburg, Germany.
    Taghia, Jalil
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Smeds, Karolina
    ORCA Europe, WS Audiology, Stockholm, Sweden.
    Bayesian analysis of Ecological Momentary Assessment (EMA) data collected in adults before and after hearing rehabilitation2023In: Frontiers in Digital Health, E-ISSN 2673-253X, Vol. 5, article id 1100705Article in journal (Refereed)
    Abstract [en]

    This paper presents a new Bayesian method for analyzing Ecological Momentary Assessment (EMA) data and applies this method in a re-analysis of data from a previous EMA study. The analysis method has been implemented as a freely available Python package EmaCalc, RRID:SCR 022943. The analysis model can use EMA input data including nominal categories in one or more situation dimensions, and ordinal ratings of several perceptual attributes. The analysis uses a variant of ordinal regression to estimate the statistical relation between these variables. The Bayesian method has no requirements related to the number of participants or the number of assessments by each participant. Instead, the method automatically includes measures of the statistical credibility of all analysis results, for the given amount of data. For the previously collected EMA data, the analysis results demonstrate how the new tool can handle heavily skewed, scarce, and clustered data that were collected on ordinal scales, and present results on interval scales. The new method revealed results for the population mean that were similar to those obtained in the previous analysis by an advanced regression model. The Bayesian approach automatically estimated the inter-individual variability in the population, based on the study sample, and could show some statistically credible intervention results also for an unseen random individual in the population. Such results may be interesting, for example, if the EMA methodology is used by a hearing-aid manufacturer in a study to predict the success of a new signal-processing method among future potential customers.

  • 29.
    Huang, Rujing
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Holzapfel, Andre
    KTH, School of Electrical Engineering and Computer Science (EECS), Human Centered Technology, Media Technology and Interaction Design, MID.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Kaila, Anna-Kaisa
    KTH, School of Electrical Engineering and Computer Science (EECS), Human Centered Technology, Media Technology and Interaction Design, MID.
    Beyond Diverse Datasets: Responsible MIR, Interdisciplinarity, and the Fractured Worlds of Music2023In: Transactions of the International Society for Music Information Retrieval, E-ISSN 2514-3298, Vol. 6, no 1, p. 43-59Article in journal (Refereed)
    Abstract [en]

    Musical worlds, not unlike our lived realities, are fundamentally fragmented and diverse, a fact often seen as a challenge or even a threat to the validity of research in Music Information Research (MIR). In this article, we propose to treat this characteristic of our musical universe(s) as an opportunity to fundamentally enrich and re-orient MIR. We propose that the time has arrived for MIR to reflect on its ethical and cultural turns (if they have been initiated at all) and take them a step further, with the goal of profoundly diversifying the discipline beyond the diversification of datasets. Such diversification, we argue, is likely to remain superficial if it is not accompanied by a simultaneous auto-critique of the discipline’s raison d’être. Indeed, this move to diversify touches on the philosophical underpinnings of what MIR is and should become as a field of research: What is music (ontology)? What are the nature and limits of knowledge concerning music (epistemology)? How do we obtain such knowledge (methodology)? And what about music and our own research endeavor do we consider “good” and “valuable” (axiology)? This path involves sincere inter- and intra-disciplinary struggles that underlie MIR, and we point to “agonistic interdisciplinarity” — that we have practiced ourselves via the writing of this article — as a future worth reaching for. The two featured case studies, about possible philosophical re-orientations in approaching ethics of music AI and about responsible engineering when AI meets traditional music, indicate a glimpse of what is possible.

    Download full text (pdf)
    fulltext
  • 30.
    Lameris, Harm
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Gustafsson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Székely, Éva
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Beyond style: synthesizing speech with pragmatic functions2023In: Interspeech 2023, International Speech Communication Association , 2023, p. 3382-3386Conference paper (Refereed)
    Abstract [en]

    With recent advances in generative modelling, conversational systems are becoming more lifelike and capable of long, nuanced interactions. Text-to-Speech (TTS) is being tested in territories requiring natural-sounding speech that can mimic the complexities of human conversation. Hyper-realistic speech generation has been achieved, but a gap remains between the verbal behavior required for upscaled conversation, such as paralinguistic information and pragmatic functions, and comprehension of the acoustic prosodic correlates underlying these. Without this knowledge, reproducing these functions in speech has little value. We use prosodic correlates including spectral peaks, spectral tilt, and creak percentage for speech synthesis with the pragmatic functions of small talk, self-directed speech, advice, and instructions. We perform a MOS evaluation, and a suitability experiment in which our system outperforms a read-speech and conversational baseline.

  • 31.
    Déguernel, Ken
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. University of Lille.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Bias in Favour or Against Computational Creativity: A Survey and Reflection on the Importance of Socio-cultural Context in its Evaluation2023In: Proc. International Conference on Computational Creativity, 2023Conference paper (Refereed)
    Download full text (pdf)
    fulltext
  • 32.
    D'Amario, Sara
    et al.
    Department of Music Acoustics, mdw – University of Music and Performing Arts Vienna, Vienna, Austria; RITMO Centre for Interdisciplinary Studies in Rhythm, Time and Motion, University of Oslo, Oslo, Norway; Department of Musicology, University of Oslo, Oslo, Norway.
    Ternström, Sten
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Music Acoustics.
    Goebl, Werner
    Department of Music Acoustics, mdw – University of Music and Performing Arts Vienna, Vienna, Austria.
    Bishop, Laura
    RITMO Centre for Interdisciplinary Studies in Rhythm, Time and Motion, University of Oslo, Oslo, Norway; Department of Musicology, University of Oslo, Oslo, Norway.
    Body motion of choral singers2023In: Frontiers in Psychology, E-ISSN 1664-1078, Vol. 14Article in journal (Refereed)
    Abstract [en]

    Recent investigations on music performances have shown the relevance of singers’ body motion for pedagogical as well as performance purposes. However, little is known about how the perception of voice-matching or task complexity affects choristers’ body motion during ensemble singing. This study focussed on the body motion of choral singers who perform in duo along with a pre-recorded tune presented over a loudspeaker. Specifically, we examined the effects of the perception of voice-matching, operationalized in terms of sound spectral envelope, and task complexity on choristers’ body motion. Fifteen singers with advanced choral experience first manipulated the spectral components of a pre-recorded short tune composed for the study, by choosing the settings they felt most and least together with. Then, they performed the tune in unison (i.e., singing the same melody simultaneously) and in canon (i.e., singing the same melody but at a temporal delay) with the chosen filter settings. Motion data of the choristers’ upper body and audio of the repeated performances were collected and analyzed. Results show that the settings perceived as least together relate to extreme differences between the spectral components of the sound. The singers’ wrists and torso motion was more periodic, their upper body posture was more open, and their bodies were more distant from the music stand when singing in unison than in canon. These findings suggest that unison singing promotes an expressive-periodic motion of the upper body.

  • 33.
    Torre, Ilaria
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL. Chalmers Univ Technol, Dept Comp Sci & Engn, Gothenburg, Sweden.
    Lagerstedt, Erik
    Univ Skövde, Sch Informat, Skövde, Sweden..
    Dennler, Nathaniel
    Univ Southern Calif, Dept Comp Sci, Los Angeles, CA 90007 USA..
    Seaborn, Katie
    Tokyo Inst Technol, Dept Ind Engn & Econ, Tokyo, Japan..
    Leite, Iolanda
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Székely, Éva
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Can a gender-ambiguous voice reduce gender stereotypes in human-robot interactions?2023In: 2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN, Institute of Electrical and Electronics Engineers (IEEE) , 2023, p. 106-112Conference paper (Refereed)
    Abstract [en]

    When deploying robots, its physical characteristics, role, and tasks are often fixed. Such factors can also be associated with gender stereotypes among humans, which then transfer to the robots. One factor that can induce gendering but is comparatively easy to change is the robot's voice. Designing voice in a way that interferes with fixed factors might therefore be a way to reduce gender stereotypes in human-robot interaction contexts. To this end, we have conducted a video-based online study to investigate how factors that might inspire gendering of a robot interact. In particular, we investigated how giving the robot a gender-ambiguous voice can affect perception of the robot. We compared assessments (n=111) of videos in which a robot's body presentation and occupation mis/matched with human gender stereotypes. We found evidence that a gender-ambiguous voice can reduce gendering of a robot endowed with stereotypically feminine or masculine attributes. The results can inform more just robot design while opening new questions regarding the phenomenon of robot gendering.

  • 34.
    Figueroa, Carol
    et al.
    Furhat Robotics.
    Ochs, Magalie
    Aix-Marseille Université.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Classification of Feedback Functions in Spoken Dialog Using Large Language Models and Prosodic Features2023In: 27th Workshop on the Semantics and Pragmatics of Dialogue, Maribor: University of Maribor , 2023, p. 15-24Conference paper (Refereed)
    Abstract [en]

    Feedback utterances such as ‘yeah’, ‘mhm’,and ‘okay’, convey different communicative functions depending on their prosodic realizations, as well as the conversational context in which they are produced. In this paper, we investigate the performance of different models and features for classifying the communicative function of short feedback tokens in American English dialog. We experiment with a combination of lexical and prosodic features extracted from the feedback utterance, as well as context features from the preceding utterance of the interlocutor. Given the limited amount of training data, we explore the use of a pre-trained large language model (GPT-3) to encode contextual information, as well as SimCSE sentence embeddings. The results show that good performance can be achieved with only SimCSE and lexical features, while the best performance is achieved by solely fine-tuning GPT-3, even if it does not have access to any prosodic features.

  • 35.
    Stenwig, Eline
    et al.
    Norwegian Univ Sci & Technol, Dept Circulat & Med Imaging, Trondheim, Norway..
    Salvi, Giampiero
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. Norwegian Univ Sci & Technol, Dept Elect Syst, Trondheim, Norway..
    Rossi, Pierluigi Salvo
    Norwegian Univ Sci & Technol, Dept Elect Syst, Trondheim, Norway..
    Skjaervold, Nils Kristian
    Norwegian Univ Sci & Technol, Dept Circulat & Med Imaging, Trondheim, Norway.;St Olavs Univ Hosp, Clin Anaesthesia & Intens Care Med, Trondheim, Norway..
    Comparison of correctly and incorrectly classified patients for in-hospital mortality prediction in the intensive care unit2023In: BMC Medical Research Methodology, E-ISSN 1471-2288, Vol. 23, no 1, article id 102Article in journal (Refereed)
    Abstract [en]

    Background

    The use of machine learning is becoming increasingly popular in many disciplines, but there is still an implementation gap of machine learning models in clinical settings. Lack of trust in models is one of the issues that need to be addressed in an effort to close this gap. No models are perfect, and it is crucial to know in which use cases we can trust a model and for which cases it is less reliable.

    Methods

    Four different algorithms are trained on the eICU Collaborative Research Database using similar features as the APACHE IV severity-of-disease scoring system to predict hospital mortality in the ICU. The training and testing procedure is repeated 100 times on the same dataset to investigate whether predictions for single patients change with small changes in the models. Features are then analysed separately to investigate potential differences between patients consistently classified correctly and incorrectly.

    Results

    A total of 34 056 patients (58.4%) are classified as true negative, 6 527 patients (11.3%) as false positive, 3 984 patients (6.8%) as true positive, and 546 patients (0.9%) as false negatives. The remaining 13 108 patients (22.5%) are inconsistently classified across models and rounds. Histograms and distributions of feature values are compared visually to investigate differences between groups.ConclusionsIt is impossible to distinguish the groups using single features alone. Considering a combination of features, the difference between the groups is clearer. Incorrectly classified patients have features more similar to patients with the same prediction rather than the same outcome.

  • 36.
    Fallgren, Per
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Crowdsource-based validation of the audio cocktail as a sound browsing tool2023In: Interspeech 2023, International Speech Communication Association , 2023, p. 2178-2182Conference paper (Refereed)
    Abstract [en]

    We conduct two crowdsourcing experiments designed to examine the usefulness of audio cocktails to quickly find out information on the contents of large audio data. Several thousand crowd workers were engaged to listen to audio cocktails with systematically varied composition. They were then asked to state either which sound out of four categories (Children, Women, Men, Orchestra) they heard the most of, or if they heard anything of a specific category at all. The results show that their responses have high reliability and provide information as to whether a specific task can be performed using audio cocktails. We also propose that the combination of crowd workers and audio cocktails can be used directly as a tool to investigate the contents of large audio data.

  • 37.
    Feindt, Kathrin
    et al.
    ISFAS, Kiel University, Germany.
    Rossi, Martina
    ISFAS, Kiel University, Germany.
    Esfandiari-Baiat, Ghazaleh
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Ekström, Axel G.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Zellers, Margaret
    ISFAS, Kiel University, Germany.
    Cues to next-speaker projection in conversational Swedish: Evidence from reaction times2023In: Interspeech 2023, International Speech Communication Association , 2023, p. 1040-1044Conference paper (Refereed)
    Abstract [en]

    We present first results of a study investigating the salience and typicality of prosodic markers in Swedish at turn ends for turn-yielding and turn-keeping purposes. We performed an experiment where participants (N=32) were presented with conversational chunks and, after the audio ended, were asked to determine which of two speakers would speak next by clicking a picture on a screen. Audio stimuli were manipulated by (i) raising and (ii) lowering f0 over the last 500 ms of a turn, (iii) speeding up or (iv) slowing down duration over the last 500 ms, and (v) raising and (vi) lowering the last pitch peak. In our data, out of all manipulations, increasing the speech rate was found to be the most disruptive (p < .005). Higher speech rate led to longer reaction times in turn-keeping, which were shorter in turn-yielding. Other manipulations did not significantly alter reaction times. The results presented here may be complemented with eye movement data, to further elucidate cognitive mechanisms underlying turn-taking behavior.

  • 38.
    Getman, Yaroslav
    et al.
    Aalto Univ, Dept Informat & Commun Engn, Espoo 02150, Finland..
    Phan, Nhan
    Aalto Univ, Dept Informat & Commun Engn, Espoo 02150, Finland..
    Al-Ghezi, Ragheb
    Aalto Univ, Dept Informat & Commun Engn, Espoo 02150, Finland..
    Voskoboinik, Ekaterina
    Aalto Univ, Dept Informat & Commun Engn, Espoo 02150, Finland..
    Singh, Mittul
    Aalto Univ, Dept Informat & Commun Engn, Espoo 02150, Finland.;Silo AI, Helsinki 00180, Finland..
    Grosz, Tamas
    Aalto Univ, Dept Informat & Commun Engn, Espoo 02150, Finland..
    Kurimo, Mikko
    Aalto Univ, Dept Informat & Commun Engn, Espoo 02150, Finland..
    Salvi, Giampiero
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. Norwegian Univ Sci & Technol, Dept Signal Proc, N-7034 Trondheim, Norway.;KTH Royal Inst Technol, EECS, S-11428 Stockholm, Sweden..
    Svendsen, Torbjorn
    Norwegian Univ Sci & Technol, Dept Signal Proc, N-7034 Trondheim, Norway..
    Strombergsson, Sofia
    Karolinska Inst, Dept Clin Sci Intervent & Technol, S-14152 Huddinge, Sweden..
    Smolander, Anna
    Tampere Univ, Fac Social Sci, Logoped, Welf Sci, Tampere 33100, Finland..
    Ylinen, Sari
    Tampere Univ, Fac Social Sci, Logoped, Welf Sci, Tampere 33100, Finland..
    Developing an AI-Assisted Low-Resource Spoken Language Learning App for Children2023In: IEEE Access, E-ISSN 2169-3536, Vol. 11, p. 86025-86037Article in journal (Refereed)
    Abstract [en]

    Computer-assisted Language Learning (CALL) is a rapidly developing area accelerated by advancements in the field of AI. A well-designed and reliable CALL system allows students to practice language skills, like pronunciation, any time outside of the classroom. Furthermore, gamification via mobile applications has shown encouraging results on learning outcomes and motivates young users to practice more and perceive language learning as a positive experience. In this work, we adapt the latest speech recognition technology to be a part of an online pronunciation training system for small children. As part of our gamified mobile application, our models will assess the pronunciation quality of young Swedish children diagnosed with Speech Sound Disorder, and participating in speech therapy. Additionally, the models provide feedback to young non-native children learning to pronounce Swedish and Finnish words. Our experiments revealed that these new models fit into an online game as they function as speech recognizers and pronunciation evaluators simultaneously. To make our systems more trustworthy and explainable, we investigated whether the combination of modern input attribution algorithms and time-aligned transcripts can explain the decisions made by the models, give us insights into how the models work and provide a tool to develop more reliable solutions.

  • 39.
    Deichler, Anna
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Mehta, Shivam
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation2023In: PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, Association for Computing Machinery (ACM) , 2023, p. 755-762Conference paper (Refereed)
    Abstract [en]

    This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing difusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the difusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.

  • 40. Offrede, Tom
    et al.
    Mishra, Chinmaya
    Furhat Robotics.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. Furhat Robotics.
    Fuchs, Susanne
    Leibniz-Zentrum Allgemeine Sprachwissenschaft.
    Mooshammer, Christine
    Do Humans Converge Phonetically When Talking to a Robot?2023In: Proceedings of the 20th International Congress of Phonetic Sciences, Prague 2023, 2023, p. 3507-3511Conference paper (Refereed)
  • 41.
    Axelsson, Agnes
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Do you follow?: A fully automated system for adaptive robot presenters2023In: HRI 2023: Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction, Association for Computing Machinery (ACM) , 2023, p. 102-111Conference paper (Refereed)
    Abstract [en]

    An interesting application for social robots is to act as a presenter, for example as a museum guide. In this paper, we present a fully automated system architecture for building adaptive presentations for embodied agents. The presentation is generated from a knowledge graph, which is also used to track the grounding state of information, based on multimodal feedback from the user. We introduce a novel way to use large-scale language models (GPT-3 in our case) to lexicalise arbitrary knowledge graph triples, greatly simplifying the design of this aspect of the system. We also present an evaluation where 43 participants interacted with the system. The results show that users prefer the adaptive system and consider it more human-like and flexible than a static version of the same system, but only partial results are seen in their learning of the facts presented by the robot.

  • 42.
    Mishra, Chinmaya
    et al.
    Furhat Robot AB, Stockholm, Sweden..
    Offrede, Tom
    Humboldt Univ, Berlin, Germany..
    Fuchs, Susanne
    Leibniz Ctr Gen Linguist ZAS, Berlin, Germany..
    Mooshammer, Christine
    Humboldt Univ, Berlin, Germany..
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. Furhat Robot AB, Stockholm, Sweden..
    Does a robot's gaze aversion affect human gaze aversion?2023In: Frontiers in Robotics and AI, E-ISSN 2296-9144, Vol. 10, article id 1127626Article in journal (Refereed)
    Abstract [en]

    Gaze cues serve an important role in facilitating human conversations and are generally considered to be one of the most important non-verbal cues. Gaze cues are used to manage turn-taking, coordinate joint attention, regulate intimacy, and signal cognitive effort. In particular, it is well established that gaze aversion is used in conversations to avoid prolonged periods of mutual gaze. Given the numerous functions of gaze cues, there has been extensive work on modelling these cues in social robots. Researchers have also tried to identify the impact of robot gaze on human participants. However, the influence of robot gaze behavior on human gaze behavior has been less explored. We conducted a within-subjects user study (N = 33) to verify if a robot's gaze aversion influenced human gaze aversion behavior. Our results show that participants tend to avert their gaze more when the robot keeps staring at them as compared to when the robot exhibits well-timed gaze aversions. We interpret our findings in terms of intimacy regulation: humans try to compensate for the robot's lack of gaze aversion.

  • 43.
    Engwall, Olov
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Bandera Rubio, Juan Pedro
    Departemento de Tecnología Electrónica, University of Málaga, Málaga, Spain.
    Bensch, Suna
    Department of Computing Science, Umeå University, Umeå, Sweden.
    Haring, Kerstin Sophie
    Robots and Sensors for the Human Well-Being, Ritchie School of Engineering and Computer Science, University of Denver, Denver, United States.
    Kanda, Takayuki
    HRI Lab, Kyoto University, Kyoto, Japan.
    Núñez, Pedro
    Tecnología de los Computadores y las Comunicaciones Department, University of Extremadura, Badajoz, Spain.
    Rehm, Matthias
    The Technical Faculty of IT and Design, Aalborg University, Aalborg, Denmark.
    Sgorbissa, Antonio
    Dipartimento di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi, University of Genoa, Genoa, Italy.
    Editorial: Socially, culturally and contextually aware robots2023In: Frontiers in Robotics and AI, E-ISSN 2296-9144, Vol. 10, article id 1232215Article in journal (Other academic)
  • 44.
    Ekström, Axel G.
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Evolution of the human tongue and emergence of speech biomechanics2023In: Frontiers in Psychology, E-ISSN 1664-1078, Vol. 14, article id 1150778Article, review/survey (Refereed)
    Abstract [en]

    The tongue is one of the organs most central to human speech. Here, the evolution and species-unique properties of the human tongue is traced, via reference to the apparent articulatory behavior of extant non-human great apes, and fossil findings from early hominids - from a point of view of articulatory phonetics, the science of human speech production. Increased lingual flexibility provided the possibility of mapping of articulatory targets, possibly via exaptation of manual-gestural mapping capacities evident in extant great apes. The emergence of the human-specific tongue, its properties, and morphology were crucial to the evolution of human articulate speech.

  • 45.
    Sundberg, Johan
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. Univ Coll Mus Educ Stockholm, Stockholm, Sweden..
    La, Filipa
    Natl Distance Educ Univ UNED, Fac Educ, Dept Didact Sch Org & Special Didact, Madrid 28040, Spain..
    Granqvist, Svante
    Karolinska Inst, Dept Clin Sci Intervent & Technol, Div Speech & Language Pathol, Stockholm, Sweden..
    Fundamental frequency disturbances in female and male singers' pitch glides through long tube with varied resistancesa2023In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 154, no 2, p. 801-807Article in journal (Refereed)
    Abstract [en]

    Source-filter interaction can disturb vocal fold vibration frequency. Resonance frequency/bandwidth ratios (Q-values) may affect such interaction. Occurrences of fundamental frequency (f(o)) disturbances were measured in ascending pitch glides produced by four female and five male singers phonating into a 70 cm long tube. Pitch glides were produced with varied resonance Q-values of the vocal tract + tube compound (VT + tube): (i) tube end open, (ii) tube end open with nasalization, and (iii) with a piece of cotton wool in the tube end (conditions Op, Ns, and Ct, respectively). Disturbances of f(o) were identified by calculating the derivative of the low-pass filtered f(o) curve. Resonance frequencies of the compound VT+tube system were determined from ringings and glottal aspiration noise observed in narrowband spectrograms. Disturbances of f(o) tended to occur when a partial was close to a resonance of the compound VT+tube system. The number of such disturbances was significantly lower when the resonance Q-values were reduced (conditions Ns and Ct), particularly for the males. In some participants, resonance Q-values seemed less influential, suggesting little effect of source-filter interaction. The study sheds light on factors affecting source-filter interaction and f(o) control and is, therefore, relevant to voice pedagogy and theory of voice production.

  • 46.
    Yoon, Youngwoo
    et al.
    ETRI, Daejeon, Republic of Korea.
    Kucherenko, Taras
    SEED, Electronic Arts (EA), Stockholm, Sweden.
    Woo, Jieyeon
    ISIR, Sorbonne University, Paris France.
    Wolfert, Pieter
    IDLab, Ghent University - imec, Ghent Belgium.
    Nagy, Rajmund
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    GENEA Workshop 2023: The 4th Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents2023In: ICMI 2023: Proceedings of the 25th International Conference on Multimodal Interaction, Association for Computing Machinery (ACM) , 2023, p. 822-823Conference paper (Refereed)
    Abstract [en]

    Non-verbal behavior is advantageous for embodied agents when interacting with humans. Despite many years of research on the generation of non-verbal behavior, there is no established benchmarking practice in the field. Most researchers do not compare their results to prior work, and if they do, they often do so in a manner that is not compatible with other approaches. The GENEA Workshop 2023 seeks to bring the community together to discuss the major challenges and solutions, and to identify the best ways to progress the field.

  • 47.
    Gustafsson, Joakim
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Székely, Éva
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Generation of speech and facial animation with controllable articulatory effort for amusing conversational characters2023In: 23rd ACM International Conference on Interlligent Virtual Agent (IVA 2023), Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper (Refereed)
    Abstract [en]

    Engaging embodied conversational agents need to generate expressive behavior in order to be believable insocializing interactions. We present a system that can generate spontaneous speech with supporting lip movements. The neural conversational TTSvoice is trained on a multi-style speech corpus that has been prosodically tagged (pitch and speaking rate) and transcribed (including tokens for breathing, fillers and laughter). We introduce a speech animation algorithm where articulatory effort can be adjusted. The facial animation is driven by time-stamped phonemes and prominence estimates from the synthesised speech waveform to modulate the lip and jaw movements accordingly. In objective evaluations we show that the system is able to generate speech and facial animation that vary in articulation effort. In subjective evaluations we compare our conversational TTS system’s capability to deliver jokes with a commercial TTS. Both systems succeeded equally good.

    Download full text (pdf)
    fulltext
  • 48.
    Wozniak, Maciej K.
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Stower, Rebecca
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Jensfelt, Patric
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Happily Error After: Framework Development and User Study for Correcting Robot Perception Errors in Virtual Reality2023In: 2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN, Institute of Electrical and Electronics Engineers (IEEE) , 2023, p. 1573-1580Conference paper (Refereed)
    Abstract [en]

    While we can see robots in more areas of our lives, they still make errors. One common cause of failure stems from the robot perception module when detecting objects. Allowing users to correct such errors can help improve the interaction and prevent the same errors in the future. Consequently, we investigate the effectiveness of a virtual reality (VR) framework for correcting perception errors of a Franka Panda robot. We conducted a user study with 56 participants who interacted with the robot using both VR and screen interfaces. Participants learned to collaborate with the robot faster in the VR interface compared to the screen interface. Additionally, participants found the VR interface more immersive, enjoyable, and expressed a preference for using it again. These findings suggest that VR interfaces may offer advantages over screen interfaces for human-robot interaction in erroneous environments.

  • 49.
    Miniotaitė, Jūra
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Human Centered Technology, Media Technology and Interaction Design, MID.
    Wang, Siyang
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Székely, Éva
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Hi robot, it's not what you say, it's how you say it2023In: 2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN, Institute of Electrical and Electronics Engineers (IEEE) , 2023, p. 307-314Conference paper (Refereed)
    Abstract [en]

    Many robots use their voice to communicate with people in spoken language but the voices commonly used for robots are often optimized for transactional interactions, rather than social ones. This can limit their ability to create engaging and natural interactions. To address this issue, we designed a spontaneous text-to-speech tool and used it to author natural and spontaneous robot speech. A crowdsourcing evaluation methodology is proposed to compare this type of speech to natural speech and state-of-the-art text-to-speech technology, both in disembodied and embodied form. We created speech samples in a naturalistic setting of people playing tabletop games and conducted a user study evaluating Naturalness, Intelligibility, Social Impression, Prosody, and Perceived Intelligence. The speech samples were chosen to represent three contexts that are common in tabletopgames and the contexts were introduced to the participants that evaluated the speech samples. The study results show that the proposed evaluation methodology allowed for a robust analysis that successfully compared the different conditions. Moreover, the spontaneous voice met our target design goal of being perceived as more natural than a leading commercial text-to-speech.

  • 50.
    McMillan, Donald
    et al.
    Stockholm University, Stockholm, Sweden.
    Jaber, Razan
    University College Dublin, Dublin, Ireland.
    Cowan, Benjamin R.
    University College Dublin, Dublin, Ireland.
    Fischer, Joel E.
    University of Nottingham, Nottingham, United Kingdom.
    Irfan, Bahar
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Cumbal, Ronald
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Zargham, Nima
    Digital Media Lab, University of Bremen, Germany.
    Lee, Minha
    Eindhoven University of Technology, Eindhoven, The Netherlands.
    Human-Robot Conversational Interaction (HRCI)2023In: HRI 2023: Companion of the ACM/IEEE International Conference on Human-Robot Interaction, Association for Computing Machinery (ACM) , 2023, p. 923-925Conference paper (Refereed)
    Abstract [en]

    Conversation is one of the primary methods of interaction between humans and robots. It provides a natural way of communication with the robot, thereby reducing the obstacles that can be faced through other interfaces (e.g., text or touch) that may cause difficulties to certain populations, such as the elderly or those with disabilities, promoting inclusivity in Human-Robot Interaction (HRI).Work in HRI has contributed significantly to the design, understanding and evaluation of human-robot conversational interactions. Concurrently, the Conversational User Interfaces (CUI) community has developed with similar aims, though with a wider focus on conversational interactions across a range of devices and platforms. This workshop aims to bring together the CUI and HRI communities to outline key shared opportunities and challenges in developing conversational interactions with robots, resulting in collaborative publications targeted at the CUI 2023 provocations track.

1234567 1 - 50 of 1494
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf