kth.sePublications
Change search
Refine search result
1 - 26 of 26
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Abelho Pereira, André Tiago
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Oertel, Catharine
    TU Delft Delft, Netherlands.
    Fermoselle, Leonor
    TNO Den Haag, Netherlands.
    Mendelson, Joe
    Furhat Robotics Stockholm, Sweden.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Effects of Different Interaction Contexts when Evaluating Gaze Models in HRI2020Conference paper (Refereed)
    Abstract [en]

    We previously introduced a responsive joint attention system that uses multimodal information from users engaged in a spatial reasoning task with a robot and communicates joint attention via the robot's gaze behavior [25]. An initial evaluation of our system with adults showed it to improve users' perceptions of the robot's social presence. To investigate the repeatability of our prior findings across settings and populations, here we conducted two further studies employing the same gaze system with the same robot and task but in different contexts: evaluation of the system with external observers and evaluation with children. The external observer study suggests that third-person perspectives over videos of gaze manipulations can be used either as a manipulation check before committing to costly real-time experiments or to further establish previous findings. However, the replication of our original adults study with children in school did not confirm the effectiveness of our gaze manipulation, suggesting that different interaction contexts can affect the generalizability of results in human-robot interaction gaze studies.

    Download full text (pdf)
    fulltext
  • 2.
    Abelho Pereira, André Tiago
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Oertel, Catharine
    Computer-Human Interaction Lab for Learning & Instruction Ecole Polytechnique Federale de Lausanne, Switzerland..
    Fermoselle, Leonor
    Mendelson, Joe
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Responsive Joint Attention in Human-Robot Interaction2019In: Proceedings 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2019, Institute of Electrical and Electronics Engineers (IEEE) , 2019, p. 1080-1087Conference paper (Refereed)
    Abstract [en]

    Joint attention has been shown to be not only crucial for human-human interaction but also human-robot interaction. Joint attention can help to make cooperation more efficient, support disambiguation in instances of uncertainty and make interactions appear more natural and familiar. In this paper, we present an autonomous gaze system that uses multimodal perception capabilities to model responsive joint attention mechanisms. We investigate the effects of our system on people’s perception of a robot within a problem-solving task. Results from a user study suggest that responsive joint attention mechanisms evoke higher perceived feelings of social presence on scales that regard the direction of the robot’s perception.

    Download full text (pdf)
    fulltext
  • 3. Adiban, Mohammad
    et al.
    Siniscalchi, Marco
    Stefanov, Kalin
    Salvi, Giampiero
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology. Norwegian University of Science and Technology Trondheim, Norway.
    Hierarchical Residual Learning Based Vector Quantized Variational Autoencorder for Image Reconstruction and Generation2022In: The 33rd British Machine Vision Conference Proceedings, 2022Conference paper (Refereed)
    Abstract [en]

    We propose a multi-layer variational autoencoder method, we call HR-VQVAE, thatlearns hierarchical discrete representations of the data. By utilizing a novel objectivefunction, each layer in HR-VQVAE learns a discrete representation of the residual fromprevious layers through a vector quantized encoder. Furthermore, the representations ateach layer are hierarchically linked to those at previous layers. We evaluate our methodon the tasks of image reconstruction and generation. Experimental results demonstratethat the discrete representations learned by HR-VQVAE enable the decoder to reconstructhigh-quality images with less distortion than the baseline methods, namely VQVAE andVQVAE-2. HR-VQVAE can also generate high-quality and diverse images that outperform state-of-the-art generative models, providing further verification of the efficiency ofthe learned representations. The hierarchical nature of HR-VQVAE i) reduces the decoding search time, making the method particularly suitable for high-load tasks and ii) allowsto increase the codebook size without incurring the codebook collapse problem.

    Download full text (pdf)
    fulltext
  • 4. Arnela, Marc
    et al.
    Dabbaghchian, Saeed
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Guasch, Oriol
    Engwall, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    MRI-based vocal tract representations for the three-dimensional finite element synthesis of diphthongs2019In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 27, no 12, p. 2173-2182Article in journal (Refereed)
    Abstract [en]

    The synthesis of diphthongs in three-dimensions (3D) involves the simulation of acoustic waves propagating through a complex 3D vocal tract geometry that deforms over time. Accurate 3D vocal tract geometries can be extracted from Magnetic Resonance Imaging (MRI), but due to long acquisition times, only static sounds can be currently studied with an adequate spatial resolution. In this work, 3D dynamic vocal tract representations are built to generate diphthongs, based on a set of cross-sections extracted from MRI-based vocal tract geometries of static vowel sounds. A diphthong can then be easily generated by interpolating the location, orientation and shape of these cross-sections, thus avoiding the interpolation of full 3D geometries. Two options are explored to extract the cross-sections. The first one is based on an adaptive grid (AG), which extracts the cross-sections perpendicular to the vocal tract midline, whereas the second one resorts to a semi-polar grid (SPG) strategy, which fixes the cross-section orientations. The finite element method (FEM) has been used to solve the mixed wave equation and synthesize diphthongs [${\alpha i}$] and [${\alpha u}$] in the dynamic 3D vocal tracts. The outputs from a 1D acoustic model based on the Transfer Matrix Method have also been included for comparison. The results show that the SPG and AG provide very close solutions in 3D, whereas significant differences are observed when using them in 1D. The SPG dynamic vocal tract representation is recommended for 3D simulations because it helps to prevent the collision of adjacent cross-sections.

    Download full text (pdf)
    fulltext
  • 5.
    Cumbal, Ronald
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Engwall, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Speaking Transparently: Social Robots in Educational Settings2024In: Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction (HRI '24 Companion), March 11--14, 2024, Boulder, CO, USA, 2024Conference paper (Refereed)
    Abstract [en]

    The recent surge in popularity of Large Language Models, known for their inherent opacity, has increased the interest in fostering transparency in technology designed for human interaction. This concern is equally prevalent in the development of Social Robots, particularly when these are designed to engage in critical areas of our society, such as education or healthcare. In this paper we propose an experiment to investigate how users can be made aware of the automated decision processes when interacting in a discussion with a social robot. Our main objective is to assess the effectiveness of verbal expressions in fostering transparency within groups of individuals as they engage with a robot. We describe the proposed interactive settings, system design, and our approach to enhance the transparency in a robot's decision-making process for multi-party interactions.

    Download full text (pdf)
    fulltext
  • 6.
    Cumbal, Ronald
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Moell, Birger
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Águas Lopes, José David
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Engwall, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    “You don’t understand me!”: Comparing ASR Results for L1 and L2 Speakers of Swedish2021In: Proceedings Interspeech 2021, International Speech Communication Association , 2021, p. 96-100Conference paper (Refereed)
    Abstract [en]

    The performance of Automatic Speech Recognition (ASR)systems has constantly increased in state-of-the-art develop-ment. However, performance tends to decrease considerably inmore challenging conditions (e.g., background noise, multiplespeaker social conversations) and with more atypical speakers(e.g., children, non-native speakers or people with speech dis-orders), which signifies that general improvements do not nec-essarily transfer to applications that rely on ASR, e.g., educa-tional software for younger students or language learners. Inthis study, we focus on the gap in performance between recog-nition results for native and non-native, read and spontaneous,Swedish utterances transcribed by different ASR services. Wecompare the recognition results using Word Error Rate and an-alyze the linguistic factors that may generate the observed tran-scription errors.

    Download full text (pdf)
    fulltext
  • 7.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Tånnander, Christina
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology. Myndigheten för tillgängliga medier, MTM.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Audience response system-based assessment for analysis-by-synthesis2015In: Proc. of ICPhS 2015, ICPhS , 2015Conference paper (Refereed)
  • 8.
    Engwall, Olov
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Cumbal, Ronald
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Majlesi, Ali Reza
    Socio-cultural perception of robot backchannels2023In: Frontiers in Robotics and AI, E-ISSN 2296-9144, Vol. 10Article in journal (Refereed)
    Abstract [en]

    Introduction: Backchannels, i.e., short interjections by an interlocutor to indicate attention, understanding or agreement regarding utterances by another conversation participant, are fundamental in human-human interaction. Lack of backchannels or if they have unexpected timing or formulation may influence the conversation negatively, as misinterpretations regarding attention, understanding or agreement may occur. However, several studies over the years have shown that there may be cultural differences in how backchannels are provided and perceived and that these differences may affect intercultural conversations. Culturally aware robots must hence be endowed with the capability to detect and adapt to the way these conversational markers are used across different cultures. Traditionally, culture has been defined in terms of nationality, but this is more and more considered to be a stereotypic simplification. We therefore investigate several socio-cultural factors, such as the participants’ gender, age, first language, extroversion and familiarity with robots, that may be relevant for the perception of backchannels.

    Methods: We first cover existing research on cultural influence on backchannel formulation and perception in human-human interaction and on backchannel implementation in Human-Robot Interaction. We then present an experiment on second language spoken practice, in which we investigate how backchannels from the social robot Furhat influence interaction (investigated through speaking time ratios and ethnomethodology and multimodal conversation analysis) and impression of the robot (measured by post-session ratings). The experiment, made in a triad word game setting, is focused on if activity-adaptive robot backchannels may redistribute the participants’ speaking time ratio, and/or if the participants’ assessment of the robot is influenced by the backchannel strategy. The goal is to explore how robot backchannels should be adapted to different language learners to encourage their participation while being perceived as socio-culturally appropriate.

    Results: We find that a strategy that displays more backchannels towards a less active speaker may substantially decrease the difference in speaking time between the two speakers, that different socio-cultural groups respond differently to the robot’s backchannel strategy and that they also perceive the robot differently after the session.

    Discussion: We conclude that the robot may need different backchanneling strategies towards speakers from different socio-cultural groups in order to encourage them to speak and have a positive perception of the robot.

     

    Download full text (pdf)
    fulltext
  • 9.
    Engwall, Olov
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Lopes, J.
    Cumbal, Ronald
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Berndtson, Gustav
    KTH, School of Industrial Engineering and Management (ITM).
    Lindström, Ruben
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Ekman, Patrik
    KTH.
    Hartmanis, Eric
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Jin, Emelie
    KTH, School of Industrial Engineering and Management (ITM).
    Johnston, Ella
    KTH, School of Industrial Engineering and Management (ITM).
    Tahir, Gara
    KTH, School of Industrial Engineering and Management (ITM).
    Mekonnen, Michael
    KTH.
    Learner and teacher perspectives on robot-led L2 conversation practice2022In: ReCALL, ISSN 0958-3440, E-ISSN 1474-0109, Vol. 34, no 3, p. 344-359Article in journal (Refereed)
    Abstract [en]

    This article focuses on designing and evaluating conversation practice in a second language (L2) with a robot that employs human spoken and non-verbal interaction strategies. Based on an analysis of previous work and semi-structured interviews with L2 learners and teachers, recommendations for robot-led conversation practice for adult learners at intermediate level are first defined, focused on language learning, on the social context, on the conversational structure and on verbal and visual aspects of the robot moderation. Guided by these recommendations, an experiment is set up, in which 12 pairs of L2 learners of Swedish interact with a robot in short social conversations. These robot-learner interactions are evaluated through post-session interviews with the learners, teachers' ratings of the robot's behaviour and analyses of the video-recorded conversations, resulting in a set of guidelines for robot-led conversation practice: (1) societal and personal topics increase the practice's meaningfulness for learners; (2) strategies and methods for providing corrective feedback during conversation practice need to be explored further; (3) learners should be encouraged to support each other if the robot has difficulties adapting to their linguistic level; (4) the robot should establish a social relationship by contributing with its own story, remembering the participants' input, and making use of non-verbal communication signals; and (5) improvements are required regarding naturalness and intelligibility of text-to-speech synthesis, in particular its speed, if it is to be used for conversations with L2 learners. 

  • 10.
    Engwall, Olov
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Águas Lopes, José David
    Cumbal, Ronald
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Is a Wizard-of-Oz Required for Robot-Led Conversation Practice in a Second Language?2022In: International Journal of Social Robotics, ISSN 1875-4791, E-ISSN 1875-4805Article in journal (Refereed)
    Abstract [en]

    The large majority of previous work on human-robot conversations in a second language has been performed with a human wizard-of-Oz. The reasons are that automatic speech recognition of non-native conversational speech is considered to be unreliable and that the dialogue management task of selecting robot utterances that are adequate at a given turn is complex in social conversations. This study therefore investigates if robot-led conversation practice in a second language with pairs of adult learners could potentially be managed by an autonomous robot. We first investigate how correct and understandable transcriptions of second language learner utterances are when made by a state-of-the-art speech recogniser. We find both a relatively high word error rate (41%) and that a substantial share (42%) of the utterances are judged to be incomprehensible or only partially understandable by a human reader. We then evaluate how adequate the robot utterance selection is, when performed manually based on the speech recognition transcriptions or autonomously using (a) predefined sequences of robot utterances, (b) a general state-of-the-art language model that selects utterances based on learner input or the preceding robot utterance, or (c) a custom-made statistical method that is trained on observations of the wizard’s choices in previous conversations. It is shown that adequate or at least acceptable robot utterances are selected by the human wizard in most cases (96%), even though the ASR transcriptions have a high word error rate. Further, the custom-made statistical method performs as well as manual selection of robot utterances based on ASR transcriptions. It was also found that the interaction strategy that the robot employed, which differed regarding how much the robot maintained the initiative in the conversation and if the focus of the conversation was on the robot or the learners, had marginal effects on the word error rate and understandability of the transcriptions but larger effects on the adequateness of the utterance selection. Autonomous robot-led conversations may hence work better with some robot interaction strategies.

    Download full text (pdf)
    fulltext
  • 11.
    Fallgren, Per
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Found speech and humans in the loop: Ways to gain insight into large quantities of speech2022Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Found data - data used for something other than the purpose for which it was originally collected - holds great value in many regards. It typically reflects high ecological validity, a strong cultural worth, and there are significant quantities at hand. However, it is noisy, hard to search through, and its contents are often largely unknown. This thesis explores ways to gain insight into such data collections, specifically with regard to speech and audio data.

    In recent years, deep learning approaches have shown unrivaled performance in many speech and language technology tasks. However, in addition to large datasets, many of these methods require vast quantities of high-quality labels, which are costly to produce. Moreover, while there are exceptions, machine learning models are typically trained for solving well-defined, narrow problems and perform inadequately in tasks of more general nature - such as providing a high-level description of the contents in a large audio file. This observation reveals a methodological gap that this thesis aims to fill.

    An ideal system for tackling these matters would combine humans' flexibility and general intelligence with machines' processing power and pattern-finding capabilities. With this idea in mind, the thesis explores the value of including the human-in-the-loop, specifically in the context of gaining insight into collections of found speech. The aim is to combine techniques from speech technology, machine learning paradigms, and human-in-the-loop approaches, with the overall goal of developing and evaluating novel methods for efficiently exploring large quantities of found speech data.

    One of the main contributions is Edyson, a tool for fast browsing, exploring, and annotating audio. It uses temporally disassembled audio, a technique that decouples the audio from the temporal dimension, in combination with feature extraction methods, dimensionality reduction algorithms, and a flexible listening function, which allows a user to get an informative overview of the contents.

    Furthermore, crowdsourcing is explored in the context of large-scale perception studies and speech & language data collection. Prior reports on the usefulness of crowd workers for such tasks show promise and are here corroborated.

    The thesis contributions suggest that the explored approaches are promising options for utilizing large quantities of found audio data and deserve further consideration in research and applied settings.

    Download full text (pdf)
    Kappa
  • 12.
    Fallgren, Per
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edyson: rapid human-in-the-loop browsing, exploration and annotation of large speech and audio dataManuscript (preprint) (Other academic)
    Abstract [en]

    The audio exploration tool Edyson integrates a variety of techniques to achieve the efficient exploration of otherwise prohibitively large collections of speech and other sounds. A main strength is that this combination of techniques allows us to place a human-in-the-loop in a coherent and operationalised manner. 

    The two most prominent techniques that we incorporate are temporally dis- assembled audio (TDA) and massively multi-component audio environments (MMAE). The first allows us to decouple input audio from the temporal dimen- sion by segmenting it into sound snippets of short duration, akin to the frames used in signal processing. These snippets are organised and visualised in an interactive interface where the investigator can navigate through the snippets freely while providing labels and judgements that are not tied to the tempo- ral context of the original audio. This, in turn, removes the real-time or near real-time requirement associated with temporally linear audio browsing. 

    We further argue that a human-in-the-loop inclusion, as opposed to fully automated black-box approaches, is valuable and perhaps necessary to understand and fully exploit larger quantities of found speech. 

    We describe in this paper the details of the tool and its underlying method- ologies, and provide a summary of results and findings that has come out of our efforts to validate and quantify the characteristics of this new type of audio browsing to date. 

    Download full text (pdf)
    fulltext
  • 13.
    Fallgren, Per
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Human-in-the-Loop Efficiency Analysis for Binary Classification in Edyson2021In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association , 2021, p. 3685-3689Conference paper (Refereed)
    Abstract [en]

    Edyson is a human-in-the-loop (HITL) tool for browsing and annotating large amounts of audio data quickly. It builds on temporally disassembled audio and massively multi-component audio environments to overcome the cumbersome time con- straints that come with linear exploration of large audio data. This study adds the following contributions to Edyson: 1) We add the new use case of HITL binary classification by sample; 2) We explore the new domain oceanic hydrophone recordings with whale song, along with speech activity detection in noisy audio; 3) We propose a repeatable method of analysing the effi- ciency of HITL in Edyson for binary classification, specifically designed to measure the return on human time spent in a given domain. We exemplify this method on two domains, and show that for a manageable initial cost in terms of HITL, it does dif- ferentiate between suitable and unsuitable domains for our new use case - a valuable insight when working with large collections of audio.

    Download full text (pdf)
    fulltext
  • 14.
    Fallgren, Per
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The audio cocktail as a sound browsing tool - a crowdsourcing based validationManuscript (preprint) (Other academic)
    Abstract [en]

    We conduct two crowdsourcing experiments designed to examine the usefulness of audio cocktails to quickly find out information on the contents of large audio data. Several thousand crowd workers were engaged to listen to audio cocktails with systematically varied composition. They were then asked to state either which sound out of four categories (Children, Women, Men, Orchestra) they heard the most of, or if they heard anything of a specific category at all. The results show that their responses have high reliability and provide information as to whether a specific task can be performed using audio cocktails. We also propose that the combination of crowd workers and audio cocktails can be used directly as a tool to investigate the contents of large audio data.

    Download full text (pdf)
    fulltext
  • 15.
    Gustafsson, Joakim Körner
    et al.
    Karolinska Institutet.
    Södersten, Maria
    Karolinska Institutet.
    Ternström, Sten
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Schalling, Ellika
    Karolinska Institutet.
    Voice use in daily life studied with a portable voice accumulator in individuals with Parkinson’s disease and matched healthy controls2019In: Journal of Speech, Language and Hearing Research, ISSN 1092-4388, E-ISSN 1558-9102, Vol. 62, no 12, p. 4324-4334Article in journal (Refereed)
    Abstract [en]

    Purpose: The purpose of this work was to study how voice use in daily life is impacted by Parkinson’s disease (PD), specifically if there is a difference in voice sound level and phonation ratio during everyday activities for individuals with PD and matched healthy controls. A further aim was to study how variations in environmental noise impact voice use. Method: Long-term registration of voice use during 1 week in daily life was performed for 21 participants with PD (11 male, 10 female) and 21 matched healthy controls using the portable voice accumulator VoxLog. Voice use was assessed through registrations of spontaneous speech in different ranges of environmental noise in daily life and in a controlled studio recording setting. Results: Individuals with PD use their voice 50%-60% less than their matched healthy controls in daily life. The difference increases in high levels of environmental noise. Individuals with PD used an average voice sound level in daily life that was 8.11 dB (female) and 6.7 dB (male) lower than their matched healthy controls. Difference in mean voice sound level for individuals with PD and controls during spontaneous speech during a controlled studio registration was 3.0 dB for the female group and 4.1 dB for the male group. Conclusions: The observed difference in voice use in daily life between individuals with PD and matched healthy controls is a 1st step to objectively quantify the impact of PD on communicative participation. The variations in voice use in different levels of environmental noise and when comparing controlled and variable environments support the idea that the study of voice use should include methods to assess function in less controlled situations outside the clinical setting.

  • 16.
    Götze, Jana
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Boye, Johan
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    SPACEREF: a corpus of street-level geographic descriptions2016In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), 2016, p. 3822-3827Conference paper (Refereed)
    Abstract [en]

    This article describes SPACEREF, a corpus of street-level geographic descriptions. Pedestrians are walking a route in a (real) urban environment, describing their actions. Their position is automatically logged, their speech is manually transcribed, and their references to objects are manually annotated with respect to a crowdsourced geographic database. We describe how the data was collected and annotated, and how it has been used in the context of creating resources for an automatic pedestrian navigation system.

    Download full text (pdf)
    fulltext
  • 17.
    Kammerlander, Robin K.
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Using Virtual Reality to Support Acting in Motion Capture with Differently Scaled Characters2021In: 2021 IEEE VIRTUAL REALITY AND 3D USER INTERFACES (VR), Institute of Electrical and Electronics Engineers (IEEE) , 2021, p. 402-410Conference paper (Refereed)
    Abstract [en]

    Motion capture is a well-established technology for capturing actors' movements and performances within the entertainment industry. Many actors, however, witness the poor acting conditions associated with such recordings. Instead of detailed sets, costumes and props, they are forced to play in empty spaces wearing tight suits. Often, their co-actors will be imaginary, replaced by placeholder props, or they would be out of scale with their virtual counterparts. These problems do not only affect acting, they also cause an abundance of laborious post-processing clean-up work. To solve these challenges, we propose using a combination of virtual reality and motion capture technology to bring differently proportioned virtual characters into a shared collaborative virtual environment. A within-subjects user study with trained actors showed that our proposed platform enhances their feelings of body ownership and immersion. This in turn changed actors' performances which narrowed the gap between virtual performances and final intended animations.

  • 18.
    Kittimathaveenan, Kajornsak
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Music Acoustics.
    Ternström, Sten
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Music Acoustics. KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Localisation in virtual choirs: outcomes of simplified binaural rendering2023Conference paper (Refereed)
    Abstract [en]

    A virtual choir would find several uses in choral pedagogy and research, but it would need a relatively small computational footprint for wide uptake. On the premise that very accurate localisation might not be needed for virtual rendering of the character of the sound inside an ensemble of singers, a localisation test was conducted using binaural stimuli created using a simplified approach, with parametrically controlled delays and variable low-pass filters (historically known as a ‘shuffler’ circuit) instead of head-related impulse responses. The direct sound from a monophonic anechoic recording of a soprano was processed (1) by sending it to a reverb algorithm for making a room-acoustic diffuse field with unchanging properties, (2) with a second-order low-pass filter with a cut-off frequency descending to 3 kHz for sources from behind, (3) with second-order low-pass head-shading filters with an angle-dependent cut-off frequency for the left/right lateral shadings of the head, and (4) with the gain of the direct sound being inversely proportional to virtual distance. The recorded singer was modelled as always facing the listener; no frequency-dependent directivity was implemented. Binaural stimuli corresponding to 24 different singer positions (8 angles and 3 distances) were synthesized. 30 participants heard the stimuli in randomized order, and indicated the perceived location of the singer on polar plot response sheets, with categories to indicate the possible responses. The listeners’ discrimination of the distance categories 0.5, 1 and 2 meters (1 correct out of 3 possible) was good, at about 80% correct. Discrimination of the angle of incidence, in 45-degreecategories (1 correct out of 8 possible) was fair, at 47% correct. Angle errors were mostly on the ‘cone of confusion’ (back-front symmetry), suggesting that the back-front cue was not very salient. The correct back-front responses (about 50%) dominated only somewhat over the incorrect ones (about 38%). In an ongoing follow-up study, multi-singer scenarios will be tested, and a more detailed yet still parametric filtering scheme will be explored.

    Download full text (pdf)
    fulltext
  • 19.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Mutual Understanding in Situated Interactions with Conversational User Interfaces: Theory, Studies, and Computation2022Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    This dissertation presents advances in HCI through a series of studies focusing on task-oriented interactions between humans and between humans and machines. The notion of mutual understanding is central, also known as grounding in psycholinguistics, in particular how people establish understanding in conversations and what interactional phenomena are present in that process. Addressing the gap in computational models of understanding, interactions in this dissertation are observed through multisensory input and evaluated with statistical and machine-learning models. As it becomes apparent, miscommunication is ordinary in human conversations and therefore embodied computer interfaces interacting with humans are subject to a large number of conversational failures. Investigating how these inter- faces can evaluate human responses to distinguish whether spoken utterances are understood is one of the central contributions of this thesis.

    The first papers (Papers A and B) included in this dissertation describe studies on how humans establish understanding incrementally and how they co-produce utterances to resolve misunderstandings in joint-construction tasks. Utilising the same interaction paradigm from such human-human settings, the remaining papers describe collaborative interactions between humans and machines with two central manipulations: embodiment (Papers C, D, E, and F) and conversational failures (Papers D, E, F, and G). The methods used investigate whether embodiment affects grounding behaviours among speakers and what verbal and non-verbal channels are utilised in response and recovery to miscommunication. For application to robotics and conversational user interfaces, failure detection systems are developed predicting in real-time user uncertainty, paving the way for new multimodal computer interfaces that are aware of dialogue breakdown and system failures.

    Through the lens of Theory, Studies, and Computation, a comprehensive overview is presented on how mutual understanding has been observed in interactions with humans and between humans and machines. A summary of literature in mutual understanding from psycholinguistics and human-computer interaction perspectives is reported. An overview is also presented on how prior knowledge in mutual understanding has and can be observed through experimentation and empirical studies, along with perspectives of how knowledge acquired through observation is put into practice through the analysis and development of computational models. Derived from literature and empirical observations, the central thesis of this dissertation is that embodiment and mutual understanding are intertwined in task-oriented interactions, both in successful communication but also in situations of miscommunication.

    Download full text (pdf)
    fulltext
  • 20.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Sahindal, Boran
    KTH.
    van Waveren, Sanne
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Behavioural Responses to Robot Conversational Failures2020In: HRI '20: Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, ACM Digital Library, 2020Conference paper (Refereed)
    Abstract [en]

    Humans and robots will increasingly collaborate in domestic environments which will cause users to encounter more failures in interactions. Robots should be able to infer conversational failures by detecting human users’ behavioural and social signals. In this paper, we study and analyse these behavioural cues in response to robot conversational failures. Using a guided task corpus, where robot embodiment and time pressure are manipulated, we ask human annotators to estimate whether user affective states differ during various types of robot failures. We also train a random forest classifier to detect whether a robot failure has occurred and compare results to human annotator benchmarks. Our findings show that human-like robots augment users’ reactions to failures, as shown in users’ visual attention, in comparison to non-humanlike smart-speaker embodiments. The results further suggest that speech behaviours are utilised more in responses to failures when non-human-like designs are present. This is particularly important to robot failure detection mechanisms that may need to consider the robot’s physical design in its failure detection model.

  • 21.
    Malisz, Zofia
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Berthelsen, Harald
    STTS – Södermalms talteknologiservice AB.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    PROMIS: a statistical-parametric speech synthesis system with prominence control via a prominence network2019In: Proceedings of SSW 10 - The 10th ISCA Speech Synthesis Workshop, Vienna, 2019Conference paper (Refereed)
    Abstract [en]

    We implement an architecture with explicit prominence learning via a prominence network in Merlin, a statistical-parametric DNN-based text-to-speech system. We build on our previous results that successfully evaluated the inclusion of an automatically extracted, speech-based prominence feature into the training and its control at synthesis time. In this work, we expand the PROMIS system by implementing the prominence network that predicts prominence values from text. We test the network predictions as well as the effects of a prominence control module based on SSML-like tags. Listening tests for the complete PROMIS system, combining a prominence feature, a prominence network and prominence control, show that it effectively controls prominence in a diagnostic set of target words. The tests also show a minor negative impact on perceived naturalness, relative to baseline, exerted by the two prominence tagging methods implemented in the control module.

    Download full text (pdf)
    fulltext
  • 22.
    Malisz, Zofia
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Valentini-Botinhao, Cassia
    The Centre for Speech Technology, The University of Edinburgh, UK.
    Watts, Oliver
    The Centre for Speech Technology, The University of Edinburgh, UK.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The speech synthesis phoneticians need is both realistic and controllable2019In: Proceedings from FONETIK 2019, Stockholm, 2019Conference paper (Refereed)
    Abstract [en]

    We discuss the circumstances that have led to a disjoint advancement of speech synthesis and phonetics in recent dec- ades. The difficulties mainly rest on the pursuit of orthogonal goals by the two fields: realistic vs. controllable synthetic speech. We make a case for realising the promise of speech technologies in areas of speech sciences by developing control of neural speech synthesis and bringing the two areas into dialogue again.

    Download full text (pdf)
    fulltext
  • 23.
    Malisz, Zofia
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Jonell, Patrik
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The visual prominence of whispered speech in Swedish2019In: Proceedings of 19th International Congress of Phonetic Sciences, 2019Conference paper (Refereed)
    Abstract [en]

    This study presents a database of controlled speech material as well as spontaneous Swedish conversation produced in modal and whispered voice. The database includes facial expression and head movement features tracked by a non-invasive and unobtrusive system. We analyse differences between the voice conditions in the visual domain paying particular attention to realisations of prosodic structure, namely, prominence patterns. Analysis results show that prominent vowels in whisper are expressed with a statistically significantly a) larger jaw opening, b) stronger lip rounding and protrusion, c) higher eyebrow raising and d) higher pitch angle velocity of the head, relative to modal speech.

    Download full text (pdf)
    fulltext
  • 24.
    Nagy, Rajmund
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Kucherenko, Taras
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Moell, Birger
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Kjellström, Hedvig
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Bernardet, Ulysses
    Aston University, Birmingham, UK.
    A Framework for Integrating Gesture Generation Models into Interactive Conversational Agents2021Conference paper (Refereed)
    Abstract [en]

    Embodied conversational agents (ECAs) benefit from non-verbal behavior for natural and efficient interaction with users. Gesticulation – hand and arm movements accompanying speech – is an essential part of non-verbal behavior. Gesture generation models have been developed for several decades: starting with rule-based and ending with mainly data-driven methods. To date, recent end-to-end gesture generation methods have not been evaluated in areal-time interaction with users. We present a proof-of-concept

    framework, which is intended to facilitate evaluation of modern gesture generation models in interaction. We demonstrate an extensible open-source framework that contains three components: 1) a 3D interactive agent; 2) a chatbot back-end; 3) a gesticulating system. Each component can be replaced,

    making the proposed framework applicable for investigating the effect of different gesturing models in real-time interactions with different communication modalities, chatbot backends, or different agent appearances. The code and video are available at the project page https://nagyrajmund.github.io/project/gesturebot.

  • 25.
    Näslund, Per
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Artificial Neural Networks in Swedish Speech Synthesis2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Text-to-speech (TTS) systems have entered our daily lives in the form of smart assistants and many other applications. Contemporary re- search applies machine learning and artificial neural networks (ANNs) to synthesize speech. It has been shown that these systems outperform the older concatenative and parametric methods.

    In this paper, ANN-based methods for speech synthesis are ex- plored and one of the methods is implemented for the Swedish lan- guage. The implemented method is dubbed “Tacotron” and is a first step towards end-to-end ANN-based TTS which puts many differ- ent ANN-techniques to work. The resulting system is compared to a parametric TTS through a strength-of-preference test that is carried out with 20 Swedish speaking subjects. A statistically significant pref- erence for the ANN-based TTS is found. Test subjects indicate that the ANN-based TTS performs better than the parametric TTS when it comes to audio quality and naturalness but sometimes lacks in intelli- gibility.

    Download full text (pdf)
    fulltext
  • 26.
    Tånnander, Christina
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Stress manipulation in text-to-speech synthesis using speaking rate categories2021In: Proceedings of Fonetik 2021, Centre for Languages and Literature, Lund University / [ed] Anna Hjortdal and Mikael Roll, Lund, 2021, Vol. 56, p. 17-22Conference paper (Other academic)
    Abstract [en]

    The challenge of controlling prosody in text-to-speech systems (TTS) is as old as TTS itself. The problem is not just to know what the desired stress or intonation patterns are, nor is it limited to knowing how to control specific speech parameters (e.g. durations, amplitude and fundamental frequency). We also need to know the precise speech parameters settings that correspond to a certain stress or intonation pattern ±over entire utterances.We propose that the powerful TTS models afforded by deep neural networks (DNN¶s), combined with the fact that speech parameters often are correlated and vary in orchestration, allow us to solve at least some stress and intonation parts by influencing a single easy-to-controlparameter, rather than detailed control over many parameters.The paper presents a straightforward method of guiding word durations without recording training material especially for this purpose. The resulting TTS engine is used to produce sentences containing Swedish words that are unstressed in their most common function, but stressed in another common function. The sentences are designed so that it is clear to a listener that the second function is the intended. In these cases, TTS engines often fail and produce an unstressed version.A group of 20 listeners compared samples that the TTS produced without guidance with samples where it was instructed to slow down the test words. The listeners almost unanimously preferred the latter version. This supports the notion that due to the orchestrated variation of speech characteristics and the strength of modern DNN models, we can provide prosodic guidance to DNN-based TTS systems without having to control every characteristic in detail.

1 - 26 of 26
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf