Endre søk
Begrens søket
1 - 50 of 50
RefereraExporteraLink til resultatlisten
Permanent link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Treff pr side
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Forfatter A-Ø
  • Forfatter Ø-A
  • Tittel A-Ø
  • Tittel Ø-A
  • Type publikasjon A-Ø
  • Type publikasjon Ø-A
  • Eldste først
  • Nyeste først
  • Skapad (Eldste først)
  • Skapad (Nyeste først)
  • Senast uppdaterad (Eldste først)
  • Senast uppdaterad (Nyeste først)
  • Disputationsdatum (tidligste først)
  • Disputationsdatum (siste først)
  • Standard (Relevans)
  • Forfatter A-Ø
  • Forfatter Ø-A
  • Tittel A-Ø
  • Tittel Ø-A
  • Type publikasjon A-Ø
  • Type publikasjon Ø-A
  • Eldste først
  • Nyeste først
  • Skapad (Eldste først)
  • Skapad (Nyeste først)
  • Senast uppdaterad (Eldste først)
  • Senast uppdaterad (Nyeste først)
  • Disputationsdatum (tidligste først)
  • Disputationsdatum (siste først)
Merk
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1. Abou-Zleikha, Mohamed
    et al.
    Székely, Eva
    University College Dublin, Ireland.
    Cahill, Peter
    Carson-Berndsen, Julie
    Multi-level exemplar-based duration generation for expressive speech synthesis2012Inngår i: Proceedings of Speech Prosody, 2012, Vol. 2012Konferansepaper (Fagfellevurdert)
    Abstract [en]

    The generation of duration of speech units from linguistic in- formation, as one component of a prosody model, is consid- ered to be a requirement for natural sounding speech synthesis. This paper investigates the use of a multi-level exemplar-based model for duration generation for the purposes of expressive speech synthesis. The multi-level exemplar-based model has been proposed in the literature as a cognitive model for the pro- duction of duration. The implementation of this model for dura- tion generation for speech synthesis is not straightforward and requires a set of modifications to the model and that the linguis- tically related units and the context of the target units should be taken into consideration. The work presented in this paper implements this model and presents a solution to these issues through the use of prosodic-syntactic correlated data, full con- text information of the input example and corpus exemplars. 

    Fulltekst (pdf)
    fulltext
  • 2. Ahmed, Zeeshan
    et al.
    Steiner, Ingmar
    Székely, Éva
    CNGL, UCD.
    Carson-Berndsen, Julie
    A system for facial expression-based affective speech translation2013Inngår i: Proceedings of the companion publication of the 2013 international conference on Intelligent user interfaces companion, 2013, s. 57-58Konferansepaper (Fagfellevurdert)
    Abstract [en]

    In the emerging eld of speech-to-speech translation, empha- sis is currently placed on the linguistic content, while the sig- ni cance of paralinguistic information conveyed by facial ex- pression or tone of voice is typically neglected. We present a prototype system for multimodal speech-to-speech transla- tion that is able to automatically recognize and translate spo- ken utterances from one language into another, with the out- put rendered by a speech synthesis system. The novelty of our system lies in the technique of generating the synthetic speech output in one of several expressive styles that is au- tomatically determined using a camera to analyze the user’s facial expression during speech. 

    Fulltekst (pdf)
    fulltext
  • 3.
    Alexanderson, Simon
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH. KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Robotik, perception och lärande, RPL.
    Kucherenko, Taras
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Robotik, perception och lärande, RPL.
    Beskow, Jonas
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Generating coherent spontaneous speech and gesture from text2020Inngår i: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, IVA 2020, Association for Computing Machinery (ACM) , 2020Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthesise vivid and lifelike speech-driven 3D gesticulation. In this paper, we put these two state-of-the-art technologies together in a coherent fashion for the first time. Concretely, we demonstrate a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input. In contrast to previous approaches for joint speech-and-gesture generation, we generate full-body gestures from speech synthesis trained on recordings of spontaneous speech from the same person as the motion-capture data. We illustrate our results by visualising gesture spaces and textspeech-gesture alignments, and through a demonstration video.

  • 4.
    Aylett, Matthew Peter
    et al.
    Heriot Watt University and CereProc Ltd. Edinburgh, UK.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    McMillan, Donald
    Stockholm University Stockholm, Sweden.
    Skantze, Gabriel
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Romeo, Marta
    Heriot Watt University Edinburgh, UK.
    Fischer, Joel
    University of Nottingham Nottingham, UK.
    Reyes-Cruz, Gisela
    University of Nottingham Nottingham, UK.
    Why is my Agent so Slow? Deploying Human-Like Conversational Turn-Taking2023Inngår i: HAI 2023 - Proceedings of the 11th Conference on Human-Agent Interaction, Association for Computing Machinery (ACM) , 2023, s. 490-492Konferansepaper (Fagfellevurdert)
    Abstract [en]

    The emphasis on one-to-one speak/wait spoken conversational interaction with intelligent agents leads to long pauses between conversational turns, undermines the flow and naturalness of the interaction, and undermines the user experience. Despite ground breaking advances in the area of generating and understanding natural language with techniques such as LLMs, conversational interaction has remained relatively overlooked. In this workshop we will discuss and review the challenges, recent work and potential impact of improving conversational interaction with artificial systems. We hope to share experiences of poor human/system interaction, best practices with third party tools, and generate design guidance for the community.

  • 5. Betz, Simon
    et al.
    Zarrieß, Sina
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Wagner, Petra
    The greennn tree - lengthening position influences uncertainty perception2019Inngår i: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2019, The International Speech Communication Association (ISCA), 2019, s. 3990-3994Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Synthetic speech can be used to express uncertainty in dialogue systems by means of hesitation. If a phrase like “Next to the green tree” is uttered in a hesitant way, that is, containing lengthening, silences, and fillers, the listener can infer that the speaker is not certain about the concepts referred to. However, we do not know anything about the referential domain of the uncertainty; if only a particular word in this sentence would be uttered hesitantly, e.g. “the greee:n tree”, the listener could infer that the uncertainty refers to the color in the statement, but not to the object. In this study, we show that the domain of the uncertainty is controllable. We conducted an experiment in which color words in sentences like “search for the green tree” were lengthened in two different positions: word onsets or final consonants, and participants were asked to rate the uncertainty regarding color and object. The results show that initial lengthening is predominantly associated with uncertainty about the word itself, whereas final lengthening is primarily associated with the following object. These findings enable dialogue system developers to finely control the attitudinal display of uncertainty, adding nuances beyond the lexical content to message delivery.

    Fulltekst (pdf)
    fulltext
  • 6. Cabral, Joao P
    et al.
    Kane, Mark
    Ahmed, Zeeshan
    Abou-Zleikha, Mohamed
    Székely, Éva
    University College Dublin, Ireland.
    Zahra, Amalia
    Ogbureke, Kalu U
    Cahill, Peter
    Carson-Berndsen, Julie
    Schlögl, Stephan
    Rapidly Testing the Interaction Model of a Pronunciation Training System via Wizard-of-Oz.2012Inngår i: Proceedings of the International Conference on Language Resources and Evaluation, 2012, s. 4136-4142Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This paper describes a prototype of a computer-assisted pronunciation training system called MySpeech. The interface of the MySpeech system is web-based and it currently enables users to practice pronunciation by listening to speech spoken by native speakers and tuning their speech production to correct any mispronunciations detected by the system. This practice exercise is facilitated in different topics and difficulty levels. An experiment was conducted in this work that combines the MySpeech service with the WebWOZ Wizard-of-Oz platform (http://www.webwoz.com), in order to improve the human-computer interaction (HCI) of the service and the feedback that it provides to the user. The employed Wizard-of-Oz method enables a human (who acts as a wizard) to give feedback to the practising user, while the user is not aware that there is another person involved in the communication. This experiment permitted to quickly test an HCI model before its implementation on the MySpeech system. It also allowed to collect input data from the wizard that can be used to improve the proposed model. Another outcome of the experiment was the preliminary evalua- tion of the pronunciation learning service in terms of user satisfaction, which would be difficult to conduct before integrating the HCI part. 

    Fulltekst (pdf)
    fulltext
  • 7. Cabral, Joao P
    et al.
    Kane, Mark
    Ahmed, Zeeshan
    Székely, Éva
    University College Dublin, Ireland.
    Zahra, Amalia
    Ogbureke, Kalu U
    Cahill, Peter
    Carson-Berndsen, Julie
    Schlögl, Stephan
    Using the Wizard-of-Oz Framework in a Pronunciation Training System for Providing User Feedback and Instructions2012Konferansepaper (Fagfellevurdert)
    Fulltekst (pdf)
    fulltext
  • 8. Cahill, Peter
    et al.
    Ogbureke, Udochukwu
    Cabral, Joao
    Székely, Éva
    University College Dublin, Ireland.
    Abou-Zleikha, Mohamed
    Ahmed, Zeeshan
    Carson-Berndsen, Julie
    Ucd blizzard challenge 2011 entry2011Inngår i: Proceedings of the Blizzard Challenge Workshop, 2011Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This paper gives an overview of the UCD Blizzard Challenge 2011 entry. The entry is a unit selection synthesiser that uses hidden Markov models for prosodic modelling. The evaluation consisted of synthesising 2213 sentences from a high quality 15 hour dataset provided by Lessac Technologies. Results are analysed within the context of other systems and the future work for the system is discussed. 

    Fulltekst (pdf)
    fulltext
  • 9.
    Clark, Leigh
    et al.
    Univ Coll Dublin, Dublin, Ireland..
    Cowan, Benjamin R.
    Univ Coll Dublin, Dublin, Ireland..
    Edwards, Justin
    Univ Coll Dublin, Dublin, Ireland..
    Munteanu, Cosmin
    Univ Toronto, Mississauga, ON, Canada.;Univ Toronto, Toronto, ON, Canada..
    Murad, Christine
    Univ Toronto, Mississauga, ON, Canada.;Univ Toronto, Toronto, ON, Canada..
    Aylett, Matthew
    CereProc Ltd, Edinburgh, Midlothian, Scotland..
    Moore, Roger K.
    Univ Sheffield, Sheffield, S Yorkshire, England..
    Edlund, Jens
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Healey, Patrick
    Queen Mary Univ London, London, England..
    Harte, Naomi
    Trinity Coll Dublin, Dublin, Ireland..
    Torre, Ilaria
    Trinity Coll Dublin, Dublin, Ireland..
    Doyle, Philip
    Voysis Ltd, Dublin, Ireland..
    Mapping Theoretical and Methodological Perspectives for Understanding Speech Interface Interactions2019Inngår i: CHI EA '19 EXTENDED ABSTRACTS: EXTENDED ABSTRACTS OF THE 2019 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, ASSOC COMPUTING MACHINERY , 2019Konferansepaper (Fagfellevurdert)
    Abstract [en]

    The use of speech as an interaction modality has grown considerably through the integration of Intelligent Personal Assistants (IPAs- e.g. Siri, Google Assistant) into smartphones and voice based devices (e.g. Amazon Echo). However, there remain significant gaps in using theoretical frameworks to understand user behaviours and choices and how they may applied to specific speech interface interactions. This part-day multidisciplinary workshop aims to critically map out and evaluate theoretical frameworks and methodological approaches across a number of disciplines and establish directions for new paradigms in understanding speech interface user behaviour. In doing so, we will bring together participants from HCI and other speech related domains to establish a cohesive, diverse and collaborative community of researchers from academia and industry with interest in exploring theoretical and methodological issues in the field.

  • 10.
    Cowan, Benjamin R.
    et al.
    School of Information & Communication Studies, University College Dublin, Belfield, Dublin 4, Ireland.
    Branigan, Holly
    Department of Psychology, University of Edinburgh, 7 George Square, Edinburgh, EH8 9JZ, United Kingdom.
    Begum, Habiba
    HCI Centre, University of Birmingham, Edgbaston Campus, B15 2TT, United Kingdom.
    McKenna, Lucy
    ADAPT Centre, Trinity College Dublin, Dublin 2, Ireland.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    They Know as Much as We Do: Knowledge Estimation and Partner Modelling of Artificial Partners2017Inngår i: CogSci 2017 - Proceedings of the 39th Annual Meeting of the Cognitive Science Society: Computational Foundations of Cognition, The Cognitive Science Society , 2017, s. 1836-1841Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Conversation partners' assumptions about each other's knowledge (their partner models) on a subject are important in spoken interaction. However, little is known about what influences our partner models in spoken interactions with artificial partners. In our experiment we asked people to name 15 British landmarks, and estimate their identifiability to a person as well as an automated conversational agent of either British or American origin. Our results show that people's assumptions about what an artificial partner knows are related to their estimates of what other people are likely to know - but they generally estimate artificial partners to have more knowledge in the task than human partners. These findings shed light on the way in which people build partner models of artificial partners. Importantly, they suggest that people use assumptions about what other humans know as a heuristic when assessing an artificial partner's knowledge.

  • 11.
    Ekstedt, Erik
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Wang, Siyang
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafsson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Skantze, Gabriel
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis2023Inngår i: Interspeech 2023, International Speech Communication Association , 2023, s. 5481-5485Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Turn-taking is a fundamental aspect of human communication where speakers convey their intention to either hold, or yield, their turn through prosodic cues. Using the recently proposed Voice Activity Projection model, we propose an automatic evaluation approach to measure these aspects for conversational speech synthesis. We investigate the ability of three commercial, and two open-source, Text-To-Speech (TTS) systems ability to generate turn-taking cues over simulated turns. By varying the stimuli, or controlling the prosody, we analyze the models performances. We show that while commercial TTS largely provide appropriate cues, they often produce ambiguous signals, and that further improvements are possible. TTS, trained on read or spontaneous speech, produce strong turn-hold but weak turn-yield cues. We argue that this approach, that focus on functional aspects of interaction, provides a useful addition to other important speech metrics, such as intelligibility and naturalness.

  • 12.
    Elmers, Mikey
    et al.
    Saarland University, Germany.
    O'Mahony, Johannah
    University of Edinburgh, United Kingdom.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Synthesis after a couple PINTs: Investigating the Role of Pause-Internal Phonetic Particles in Speech Synthesis and Perception2023Inngår i: Interspeech 2023, International Speech Communication Association , 2023, s. 4843-4847Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Pause-internal phonetic particles (PINTs), such as breath noises, tongue clicks and hesitations, play an important role in speech perception but are rarely modeled in speech synthesis. We developed two text-to-speech (TTS) systems: one with and one without PINTs labels in the training data. Both models produced fewer PINTs and had a lower total PINTs duration than natural speech. The labeled model generated more PINTs and longer total PINTs durations than the model without labels. In a listening experiment based on the labeled model we evaluated the influence of various PINTs combinations on the perception of speaker certainty. We tested a condition without PINTs material and three conditions that included PINTs. The condition without PINTs was perceived as significantly more certain than the PINTs conditions, suggesting that we can modify how certain TTS is perceived by including PINTs.

  • 13.
    Gustafsson, Joakim
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Beskow, Jonas
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Generation of speech and facial animation with controllable articulatory effort for amusing conversational characters2023Inngår i: 23rd ACM International Conference on Interlligent Virtual Agent (IVA 2023), Institute of Electrical and Electronics Engineers (IEEE) , 2023Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Engaging embodied conversational agents need to generate expressive behavior in order to be believable insocializing interactions. We present a system that can generate spontaneous speech with supporting lip movements. The neural conversational TTSvoice is trained on a multi-style speech corpus that has been prosodically tagged (pitch and speaking rate) and transcribed (including tokens for breathing, fillers and laughter). We introduce a speech animation algorithm where articulatory effort can be adjusted. The facial animation is driven by time-stamped phonemes and prominence estimates from the synthesised speech waveform to modulate the lip and jaw movements accordingly. In objective evaluations we show that the system is able to generate speech and facial animation that vary in articulation effort. In subjective evaluations we compare our conversational TTS system’s capability to deliver jokes with a commercial TTS. Both systems succeeded equally good.

    Fulltekst (pdf)
    fulltext
  • 14.
    Kirkland, Ambika
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafsson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Pardon my disfluency: The impact of disfluency effects on the perception of speaker competence and confidence2023Inngår i: Interspeech 2023, International Speech Communication Association , 2023, s. 5217-5221Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Disfluencies are a hallmark of spontaneous speech and play an important role in conversation, yet have been shown to negatively impact judgments about speakers. We explored the role of disfluencies in the perception of competence, sincerity and confidence in public speaking contexts, using synthesized spontaneous speech. In one experiment, listeners rated 30-40-second clips which varied in terms of whether they contained filled pauses, as well as the number and types of repetition. Both the overall number of disfluencies and the repetition type had an impact on competence and confidence, and disfluent speech was also rated as less sincere. In the second experiment, the negative effects of repetition type on competence were attenuated when participants attributed disfluency to anxiety.

  • 15.
    Kirkland, Ambika
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Lameris, Harm
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafsson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Where's the uh, hesitation?: The interplay between filled pause location, speech rate and fundamental frequency in perception of confidence2022Inngår i: INTERSPEECH 2022, International Speech Communication Association , 2022, s. 4990-4994Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Much of the research investigating the perception of speaker certainty has relied on either attempting to elicit prosodic features in read speech, or artificial manipulation of recorded audio. Our novel method of controlling prosody in synthesized spontaneous speech provides a powerful tool for studying speech perception and can provide better insight into the interacting effects of prosodic features on perception while also paving the way for conversational systems which are more effectively able to engage in and respond to social behaviors. Here we have used this method to examine the combined impact of filled pause location, speech rate and f0 on the perception of speaker confidence. We found an additive effect of all three features. The most confident-sounding utterances had no filler, low f0 and high speech rate, while the least confident-sounding utterances had a medial filled pause, high f0 and low speech rate. Insertion of filled pauses had the strongest influence, but pitch and speaking rate could be used to more finely control the uncertainty cues in spontaneous speech synthesis.

  • 16.
    Kirkland, Ambika
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Włodarczak, Marcin
    Department of Linguistics, Stockholm University, Sweden.
    Gustafsson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Perception of smiling voice in spontaneous speech synthesis2021Inngår i: Proceedings of Speech Synthesis Workshop (SSW11), International Speech Communication Association , 2021, s. 108-112Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Smiling during speech production has been shown to result in perceptible acoustic differences compared to non-smiling speech. However, there is a scarcity of research on the perception of “smiling voice” in synthesized spontaneous speech. In this study, we used a sequence-to-sequence neural text-tospeech system built on conversational data to produce utterances with the characteristics of spontaneous speech. Segments of speech following laughter, and the same utterances not preceded by laughter, were compared in a perceptual experiment after removing laughter and/or breaths from the beginning of the utterance to determine whether participants perceive the utterances preceded by laughter as sounding as if they were produced while smiling. The results showed that participants identified the post-laughter speech as smiling at a rate significantly greater than chance. Furthermore, the effect of content (positive/neutral/negative) was investigated. These results show that laughter, a spontaneous, non-elicited phenomenon in our model’s training data, can be used to synthesize expressive speech with the perceptual characteristics of smiling.

    Fulltekst (pdf)
    fulltext
  • 17.
    Lameris, Harm
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafsson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Beyond style: synthesizing speech with pragmatic functions2023Inngår i: Interspeech 2023, International Speech Communication Association , 2023, s. 3382-3386Konferansepaper (Fagfellevurdert)
    Abstract [en]

    With recent advances in generative modelling, conversational systems are becoming more lifelike and capable of long, nuanced interactions. Text-to-Speech (TTS) is being tested in territories requiring natural-sounding speech that can mimic the complexities of human conversation. Hyper-realistic speech generation has been achieved, but a gap remains between the verbal behavior required for upscaled conversation, such as paralinguistic information and pragmatic functions, and comprehension of the acoustic prosodic correlates underlying these. Without this knowledge, reproducing these functions in speech has little value. We use prosodic correlates including spectral peaks, spectral tilt, and creak percentage for speech synthesis with the pragmatic functions of small talk, self-directed speech, advice, and instructions. We perform a MOS evaluation, and a suitability experiment in which our system outperforms a read-speech and conversational baseline.

  • 18.
    Lameris, Harm
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Mehta, Shivam
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH. KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Robotik, perception och lärande, RPL.
    Gustafsson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Prosody-Controllable Spontaneous TTS with Neural HMMs2023Inngår i: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Institute of Electrical and Electronics Engineers (IEEE) , 2023Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS. However, the presence of reduced articulation, fillers, repetitions, and other disfluencies in spontaneous speech make the text and acoustics less aligned than in read speech, which is problematic for attention-based TTS. We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets, while also reproducing the diversity of expressive phenomena present in spontaneous speech. Specifically, we add utterance-level prosody control to an existing neural HMM-based TTS system which is capable of stable, monotonic alignments for spontaneous speech. We objectively evaluate control accuracy and perform perceptual tests that demonstrate that prosody control does not degrade synthesis quality. To exemplify the power of combining prosody control and ecologically valid data for reproducing intricate spontaneous speech phenomena, we evaluate the system’s capability of synthesizing two types of creaky voice.

    Fulltekst (pdf)
    fulltext
  • 19.
    Lameris, Harm
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Mehta, Shivam
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Kirkland, Ambika
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Moëll, Birger
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    O'Regan, Jim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafsson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Spontaneous Neural HMM TTS with Prosodic Feature Modification2022Inngår i: Proceedings of Fonetik 2022, 2022Konferansepaper (Annet vitenskapelig)
    Abstract [en]

    Spontaneous speech synthesis is a complex enterprise, as the data has large variation, as well as speech disfluencies nor-mally omitted from read speech. These disfluencies perturb the attention mechanism present in most Text to Speech (TTS) sys-tems. Explicit modelling of prosodic features has enabled intu-itive prosody modification of synthesized speech. Most pros-ody-controlled TTS, however, has been trained on read-speech data that is not representative of spontaneous conversational prosody. The diversity in prosody in spontaneous speech data allows for more wide-ranging data-driven modelling of pro-sodic features. Additionally, prosody-controlled TTS requires extensive training data and GPU time which limits accessibil-ity. We use neural HMM TTS as it reduces the parameter size and can achieve fast convergence with stable alignments for spontaneous speech data. We modify neural HMM TTS to ena-ble prosodic control of the speech rate and fundamental fre-quency. We perform subjective evaluation of the generated speech of English and Swedish TTS models and objective eval-uation for English TTS. Subjective evaluation showed a signif-icant improvement in naturalness for Swedish for the mean prosody compared to a baseline with no prosody modification, and the objective evaluation showed greater variety in the mean of the per-utterance prosodic features.

  • 20.
    Mehta, Shivam
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Kirkland, Ambika
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Lameris, Harm
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Beskow, Jonas
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    OverFlow: Putting flows on top of neural transducers for better TTS2023Inngår i: Interspeech 2023, International Speech Communication Association , 2023, s. 4279-4283Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech.

  • 21.
    Mehta, Shivam
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Beskow, Jonas
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Neural HMMs are all you need (for high-quality attention-free TTS)2022Inngår i: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE Signal Processing Society, 2022, s. 7457-7461Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and uses non-monotonic attention. Attention failures increase training time and can make synthesis babble incoherently. This paper describes how the old and new paradigms can be combined to obtain the advantages of both worlds, by replacing attention in neural TTS with an autoregressive left-right no-skip hidden Markov model defined by a neural network. Based on this proposal, we modify Tacotron 2 to obtain an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximation. We also describe how to combine ideas from classical and contemporary TTS for best results. The resulting example system is smaller and simpler than Tacotron 2, and learns to speak with fewer iterations and less data, whilst achieving comparable naturalness prior to the post-net. Our approach also allows easy control over speaking rate.

  • 22.
    Miniotaitė, Jūra
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Människocentrerad teknologi, Medieteknik och interaktionsdesign, MID.
    Wang, Siyang
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Beskow, Jonas
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Abelho Pereira, André Tiago
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Hi robot, it's not what you say, it's how you say it2023Inngår i: 2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN, Institute of Electrical and Electronics Engineers (IEEE) , 2023, s. 307-314Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Many robots use their voice to communicate with people in spoken language but the voices commonly used for robots are often optimized for transactional interactions, rather than social ones. This can limit their ability to create engaging and natural interactions. To address this issue, we designed a spontaneous text-to-speech tool and used it to author natural and spontaneous robot speech. A crowdsourcing evaluation methodology is proposed to compare this type of speech to natural speech and state-of-the-art text-to-speech technology, both in disembodied and embodied form. We created speech samples in a naturalistic setting of people playing tabletop games and conducted a user study evaluating Naturalness, Intelligibility, Social Impression, Prosody, and Perceived Intelligence. The speech samples were chosen to represent three contexts that are common in tabletopgames and the contexts were introduced to the participants that evaluated the speech samples. The study results show that the proposed evaluation methodology allowed for a robust analysis that successfully compared the different conditions. Moreover, the spontaneous voice met our target design goal of being perceived as more natural than a leading commercial text-to-speech.

  • 23.
    Oertel, Catharine
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Jonell, Patrik
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Haddad, K. E.
    Szekely, Eva
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Using crowd-sourcing for the design of listening agents: Challenges and opportunities2017Inngår i: ISIAA 2017 - Proceedings of the 1st ACM SIGCHI International Workshop on Investigating Social Interactions with Artificial Agents, Co-located with ICMI 2017, Association for Computing Machinery (ACM), 2017, s. 37-38Konferansepaper (Fagfellevurdert)
    Abstract [en]

    In this paper we are describing how audio-visual corpora recordings using crowd-sourcing techniques can be used for the audio-visual synthesis of attitudinal non-verbal feedback expressions for virtual agents. We are discussing the limitations of this approach as well as where we see the opportunities for this technology.

  • 24.
    Szekely, Eva
    et al.
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Mendelson, Joseph
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Gustafson, Joakim
    KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.
    Synthesising uncertainty: The interplay of vocal effort and hesitation disfluencies2017Inngår i: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association , 2017, Vol. 2017, s. 804-808Konferansepaper (Fagfellevurdert)
    Abstract [en]

    As synthetic voices become more flexible, and conversational systems gain more potential to adapt to the environmental and social situation, the question needs to be examined, how different modifications to the synthetic speech interact with each other and how their specific combinations influence perception. This work investigates how the vocal effort of the synthetic speech together with added disfluencies affect listeners' perception of the degree of uncertainty in an utterance. We introduce a DNN voice built entirely from spontaneous conversational speech data and capable of producing a continuum of vocal efforts, prolongations and filled pauses with a corpus-based method. Results of a listener evaluation indicate that decreased vocal effort, filled pauses and prolongation of function words increase the degree of perceived uncertainty of conversational utterances expressing the speaker's beliefs. We demonstrate that the effect of these three cues are not merely additive, but that interaction effects, in particular between the two types of disfluencies and between vocal effort and prolongations need to be considered when aiming to communicate a specific level of uncertainty. The implications of these findings are relevant for adaptive and incremental conversational systems using expressive speech synthesis and aspiring to communicate the attitude of uncertainty.

  • 25. Székely, Éva
    et al.
    Ahmed, Zeeshan
    Cabral, Joao P
    Carson-Berndsen, Julie
    WinkTalk: a demonstration of a multimodal speech synthesis platform linking facial expressions to expressive synthetic voices2012Inngår i: Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies, Association for Computational Linguistics , 2012, s. 5-8Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This paper describes a demonstration of the WinkTalk system, which is a speech synthe- sis platform using expressive synthetic voices. With the help of a webcamera and facial ex- pression analysis, the system allows the user to control the expressive features of the syn- thetic speech for a particular utterance with their facial expressions. Based on a person- alised mapping between three expressive syn- thetic voices and the users facial expressions, the system selects a voice that matches their face at the moment of sending a message. The WinkTalk system is an early research pro- totype that aims to demonstrate that facial expressions can be used as a more intuitive control over expressive speech synthesis than manual selection of voice types, thereby con- tributing to an improved communication expe- rience for users of speech generating devices. 

    Fulltekst (pdf)
    fulltext
  • 26. Székely, Éva
    et al.
    Ahmed, Zeeshan
    Cabral, Joao P
    Carson-Berndsen, Julie
    WinkTalk: a multimodal speech synthesis interface linking facial expressions to expressive synthetic voices2012Inngår i: Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies, 2012Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This paper describes a demonstration of the WinkTalk system, which is a speech synthe- sis platform using expressive synthetic voices. With the help of a webcamera and facial ex- pression analysis, the system allows the user to control the expressive features of the syn- thetic speech for a particular utterance with their facial expressions. Based on a person- alised mapping between three expressive syn- thetic voices and the users facial expressions, the system selects a voice that matches their face at the moment of sending a message. The WinkTalk system is an early research pro- totype that aims to demonstrate that facial expressions can be used as a more intuitive control over expressive speech synthesis than manual selection of voice types, thereby con- tributing to an improved communication expe- rience for users of speech generating devices. 

    Fulltekst (pdf)
    fulltext
  • 27. Székely, Éva
    et al.
    Ahmed, Zeeshan
    Hennig, Shannon
    Cabral, Joao P
    Carson-Berndsen, Julie
    Predicting synthetic voice style from facial expressions. An application for augmented conversations2014Inngår i: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 57, s. 63-75Artikkel i tidsskrift (Fagfellevurdert)
    Abstract [en]

    The ability to efficiently facilitate social interaction and emotional expression is an important, yet unmet requirement for speech generating devices aimed at individuals with speech impairment. Using gestures such as facial expressions to control aspects of expressive synthetic speech could contribute to an improved communication experience for both the user of the device and the conversation partner. For this purpose, a mapping model between facial expressions and speech is needed, that is high level (utterance-based), versatile and personalisable. In the mapping developed in this work, visual and auditory modalities are connected based on the intended emotional salience of a message: the intensity of facial expressions of the user to the emotional intensity of the synthetic speech. The mapping model has been implemented in a system called WinkTalk that uses estimated facial expression categories and their intensity values to automat- ically select between three expressive synthetic voices reflecting three degrees of emotional intensity. An evaluation is conducted through an interactive experiment using simulated augmented conversations. The results have shown that automatic control of synthetic speech through facial expressions is fast, non-intrusive, sufficiently accurate and supports the user to feel more involved in the conversation. It can be concluded that the system has the potential to facilitate a more efficient communication process between user and listener. 

  • 28. Székely, Éva
    et al.
    Ahmed, Zeeshan
    Steiner, Ingmar
    Carson-Berndsen, Julie
    Facial expression as an input annotation modality for affective speech-to-speech translation2012Konferansepaper (Fagfellevurdert)
    Abstract [en]

    One of the challenges of speech-to-speech translation is to accurately preserve the paralinguistic information in the speaker’s message. In this work we explore the use of automatic facial expression analysis as an input annotation modality to transfer paralinguistic information at a symbolic level from input to output in speech-to-speech translation. To evaluate the feasibility of this ap- proach, a prototype system, FEAST (Facial Expression-based Affective Speech Translation) has been developed. FEAST classifies the emotional state of the user and uses it to render the translated output in an appropriate voice style, using expressive speech synthesis. 

    Fulltekst (pdf)
    fulltext
  • 29. Székely, Éva
    et al.
    Cabral, Joao P
    Abou-Zleikha, Mohamed
    Cahill, Peter
    Carson-Berndsen, Julie
    Evaluating expressive speech synthesis from audiobooks in conversational phrases2012Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Audiobooks are a rich resource of large quantities of natural sounding, highly expressive speech. In our previous research we have shown that it is possible to detect different expressive voice styles represented in a particular audiobook, using unsupervised clustering to group the speech corpus of the audiobook into smaller subsets representing the detected voice styles. These subsets of corpora of different voice styles reflect the various ways a speaker uses their voice to express involvement and affect, or imitate characters. This study is an evaluation of the detection of voice styles in an audiobook in the application of expressive speech synthesis. A further aim of this study is to investigate the usability of audiobooks as a language resource for expressive speech synthesis of utterances of conversational speech. Two evaluations have been carried out to assess the effect of the genre transfer: transmitting expressive speech from read aloud literature to conversational phrases with the application of speech synthesis. The first evaluation revealed that listeners have different voice style preferences for a particular conversational phrase. The second evaluation showed that it is possible for users of speech synthesis systems to learn the characteristics of a certain voice style well enough to make reliable predictions about what a certain utterance will sound like when synthesised using that voice style. 

    Fulltekst (pdf)
    fulltext
  • 30. Székely, Éva
    et al.
    Cabral, Joao P
    Cahill, Peter
    Carson-Berndsen, Julie
    Clustering Expressive Speech Styles in Audiobooks Using Glottal Source Parameters.2011Inngår i: 12th Annual Conference of the International-Speech-Communication-Association 2011 (INTERSPEECH 2011), ISCA-INT SPEECH COMMUNICATION ASSOC , 2011, s. 2409-2412Konferansepaper (Fagfellevurdert)
    Abstract [en]

    A great challenge for text-to-speech synthesis is to produce ex- pressive speech. The main problem is that it is difficult to syn- thesise high-quality speech using expressive corpora. With the increasing interest in audiobook corpora for speech synthesis, there is a demand to synthesise speech which is rich in prosody, emotions and voice styles. In this work, Self-Organising Fea- ture Maps (SOFM) are used for clustering the speech data using voice quality parameters of the glottal source, in order to map out the variety of voice styles in the corpus. Subjective evalu- ation showed that this clustering method successfully separated the speech data into groups of utterances associated with dif- ferent voice characteristics. This work can be applied in unit- selection synthesis by selecting appropriate data sets to synthe- sise utterances with specific voice styles. It can also be used in parametric speech synthesis to model different voice styles separately. 

    Fulltekst (pdf)
    fulltext
  • 31. Székely, Éva
    et al.
    Csapo, Tamas Gabor
    Toth, Balint
    Mihajlik, Peter
    Carson-Berndsen, Julie
    Synthesizing expressive speech from amateur audiobook recordings2012Inngår i: Spoken Language Technology Workshop (SLT), 2012, s. 297-302Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Freely available audiobooks are a rich resource of expressive speech recordings that can be used for the purposes of speech synthesis. Natural sounding, expressive synthetic voices have previously been built from audiobooks that contained large amounts of highly expressive speech recorded from a profes- sionally trained speaker. The majority of freely available au- diobooks, however, are read by amateur speakers, are shorter and contain less expressive (less emphatic, less emotional, etc.) speech both in terms of quality and quantity. Synthesiz- ing expressive speech from a typical online audiobook there- fore poses many challenges. In this work we address these challenges by applying a method consisting of minimally su- pervised techniques to align the text with the recorded speech, select groups of expressive speech segments and build expres- sive voices for hidden Markov-model based synthesis using speaker adaptation. Subjective listening tests have shown that the expressive synthetic speech generated with this method is often able to produce utterances suited to an emotional mes- sage. We used a restricted amount of speech data in our exper- iment, in order to show that the method is generally applicable to most typical audiobooks widely available online. 

    Fulltekst (pdf)
    fulltext
  • 32.
    Székely, Éva
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Edlund, Jens
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafsson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Augmented Prompt Selection for Evaluation of Spontaneous Speech Synthesis2020Inngår i: Proceedings of The 12th Language Resources and Evaluation Conference, European Language Resources Association, 2020, s. 6368-6374Konferansepaper (Fagfellevurdert)
    Abstract [en]

    By definition, spontaneous speech is unscripted and created on the fly by the speaker. It is dramatically different from read speech, where the words are authored as text before they are spoken. Spontaneous speech is emergent and transient, whereas text read out loud is pre-planned. For this reason, it is unsuitable to evaluate the usability and appropriateness of spontaneous speech synthesis by having it read out written texts sampled from for example newspapers or books. Instead, we need to use transcriptions of speech as the target - something that is much less readily available. In this paper, we introduce Starmap, a tool allowing developers to select a varied, representative set of utterances from a spoken genre, to be used for evaluation of TTS for a given domain. The selection can be done from any speech recording, without the need for transcription. The tool uses interactive visualisation of prosodic features with t-SNE, along with a tree-based algorithm to guide the user through thousands of utterances and ensure coverage of a variety of prompts. A listening test has shown that with a selection of genre-specific utterances, it is possible to show significant differences across genres between two synthetic voices built from spontaneous speech.

    Fulltekst (pdf)
    fulltext
  • 33.
    Székely, Éva
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafsson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Torre, Ilaria
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Robotik, perception och lärande, RPL.
    Prosody-controllable gender-ambiguous speech synthesis: a tool for investigating implicit bias in speech perception2023Inngår i: Interspeech 2023, International Speech Communication Association , 2023, s. 1234-1238Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This paper proposes a novel method to develop gender-ambiguous TTS, which can be used to investigate hidden gender bias in speech perception. Our aim is to provide a tool for researchers to conduct experiments on language use associated with specific genders. Ambiguous voices can also be beneficial for virtual assistants, to help reduce stereotypes and increase acceptance. Our approach uses a multi-speaker embedding in a neural TTS engine, combining two corpora recorded by a male and a female speaker to achieve a gender-ambiguous timbre. We also propose speaker-disentangled prosody control to ensure that the timbre is robust across a range of prosodies and enable more expressive speech. We optimised the output using an SSL-based network trained on hundreds of speakers. We conducted perceptual evaluations on the settings that were judged most ambiguous by the network, which showed that listeners perceived the speech samples as gender-ambiguous, also in prosody-controlled conditions.

  • 34.
    Székely, Éva
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Beskow, Jonas
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    How to train your fillers: uh and um in spontaneous speech synthesis2019Konferansepaper (Fagfellevurdert)
    Fulltekst (pdf)
    fulltext
  • 35.
    Székely, Éva
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Beskow, Jonas
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Off the cuff: Exploring extemporaneous speech delivery with TTS2019Inngår i: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association , 2019, s. 3687-3688Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Extemporaneous speech is a delivery type in public speaking which uses a structured outline but is otherwise delivered conversationally, off the cuff. This demo uses a natural-sounding spontaneous conversational speech synthesiser to simulate this delivery style. We resynthesised the beginnings of two Interspeech keynote speeches with TTS that produces multiple different versions of each utterance that vary in fluency and filled-pause placement. The platform allows the user to mark the samples according to any perceptual aspect of interest, such as certainty, authenticity, confidence, etc. During the speech delivery, they can decide on the fly which realisation to play, addressing their audience in a connected, conversational fashion. Our aim is to use this platform to explore speech synthesis evaluation options from a production perspective and in situational contexts.

    Fulltekst (pdf)
    fulltext
  • 36. Székely, Éva
    et al.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Beskow, Jonas
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Spontaneous conversational speech synthesis from found data2019Inngår i: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISCA , 2019, s. 4435-4439Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Synthesising spontaneous speech is a difficult task due to disfluencies, high variability and syntactic conventions different from those of written language. Using found data, as opposed to lab-recorded conversations, for speech synthesis adds to these challenges because of overlapping speech and the lack of control over recording conditions. In this paper we address these challenges by using a speaker-dependent CNN-LSTM breath detector to separate continuous recordings into utterances, which we here apply to extract nine hours of clean single-speaker breath groups from a conversational podcast. The resulting corpus is transcribed automatically (both lexical items and filler tokens) and used to build several voices on a Tacotron 2 architecture. Listening tests show: i) pronunciation accuracy improved with phonetic input and transfer learning; ii) it is possible to create a more fluent conversational voice by training on data without filled pauses; and iii) the presence of filled pauses improved perceived speaker authenticity. Another listening test showed the found podcast voice to be more appropriate for prompts from both public speeches and casual conversations, compared to synthesis from found read speech and from a manually transcribed lab-recorded spontaneous conversation.

  • 37.
    Székely, Éva
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Beskow, Jonas
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafsson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Breathing and Speech Planning in Spontaneous Speech Synthesis2020Inngår i: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, s. 7649-7653, artikkel-id 9054107Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Breathing and speech planning in spontaneous speech are coordinated processes, often exhibiting disfluent patterns. While synthetic speech is not subject to respiratory needs, integrating breath into synthesis has advantages for naturalness and recall. At the same time, a synthetic voice reproducing disfluent breathing patterns learned from the data can be problematic. To address this, we first propose training stochastic TTS on a corpus of overlapping breath-group bigrams, to take context into account. Next, we introduce an unsupervised automatic annotation of likely-disfluent breath events, through a product-of-experts model that combines the output of two breath-event predictors, each using complementary information and operating in opposite directions. This annotation enables creating an automatically-breathing spontaneous speech synthesiser with a more fluent breathing style. A subjective evaluation on two spoken genres (impromptu and rehearsed) found the proposed system to be preferred over the baseline approach treating all breath events the same.

    Fulltekst (pdf)
    fulltext
  • 38.
    Székely, Éva
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Casting to Corpus: Segmenting and Selecting Spontaneous Dialogue for TTS with a CNN-LSTM Speaker-Dependent Breath Detector2019Inngår i: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE , 2019, s. 6925-6929Konferansepaper (Fagfellevurdert)
    Abstract [en]

    This paper considers utilising breaths to create improved spontaneous-speech corpora for conversational text-to-speech from found audio recordings such as dialogue podcasts. Breaths are of interest since they relate to prosody and speech planning and are independent of language and transcription. Specifically, we propose a semisupervised approach where a fraction of coarsely annotated data is used to train a convolutional and recurrent speaker-specific breath detector operating on spectrograms and zero-crossing rate. The classifier output is used to find target-speaker breath groups (audio segments delineated by breaths) and subsequently select those that constitute clean utterances appropriate for a synthesis corpus. An application to 11 hours of raw podcast audio extracts 1969 utterances (106 minutes), 87% of which are clean and correctly segmented. This outperforms a baseline that performs integrated VAD and speaker attribution without accounting for breaths.

  • 39.
    Székely, Éva
    et al.
    CNGL, School of Computer Science and Informatics, University College Dublin, Ireland.
    Kane, John
    Centre for Language and Communication Studies, Trinity College Dublin, Ireland.
    Scherer, Stefan
    Centre for Language and Communication Studies, Trinity College Dublin, Ireland.
    Gobl, Christer
    Centre for Language and Communication Studies, Trinity College Dublin, Ireland; Institute for Creative Technologies, University of Southern California, Los Angeles, USA.
    Carson-Berndsen, Julie
    CNGL, School of Computer Science and Informatics, University College Dublin, Ireland.
    Detecting a targeted voice style in an audiobook using voice quality features2012Inngår i: Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, Institute of Electrical and Electronics Engineers (IEEE) , 2012, s. 4593-4596Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Audiobooks are known to contain a variety of expressive speaking styles that occur as a result of the narrator mimicking a character in a story, or expressing affect. An accurate modeling of this variety is essential for the purposes of speech synthesis from an audiobook. Voice quality differences are important features characterizing these different speaking styles, which are realized on a gradient and are often difficult to predict from the text. The present study uses a pa- rameter characterizing breathy to tense voice qualities using features of the wavelet transform, and a measure for identifying creaky seg- ments in an utterance. Based on these features, a combination of supervised and unsupervised classification is used to detect the re- gions in an audiobook, where the speaker changes his regular voice quality to a particular voice style. The target voice style candidates are selected based on the agreement of the supervised classifier en- semble output, and evaluated in a listening test. 

  • 40.
    Székely, Éva
    et al.
    School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland.
    Keane, Mark T
    Carson-Berndsen, Julie
    The Effect of Soft, Modal and Loud Voice Levels on Entrainment in Noisy Conditions2015Inngår i: Sixteenth Annual Conference of the International Speech Communication Association, 2015Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Conversation partners have a tendency to adapt their vocal in- tensity to each other and to other social and environmental fac- tors. A socially adequate vocal intensity level by a speech syn- thesiser that goes beyond mere volume adjustment is highly de- sirable for a rewarding and successful human-machine or ma- chine mediated human-human interaction. This paper examines the interaction of the Lombard effect and speaker entrainment in a controlled experiment conducted with a confederate inter- locutor. The interlocutor was asked to maintain either a soft, a modal or a loud voice level during the dialogues. Through half of the trials, subjects were exposed to a cocktail party noise through headphones. The analytical results suggest that both the background noise and the interlocutor’s voice level affect the dynamics of speaker entrainment. Speakers appear to still en- train to the voice level of their interlocutor in noisy conditions, though to a lesser extent, as strategies of ensuring intelligibility affect voice levels as well. These findings could be leveraged in spoken dialogue systems and speech generating devices to help choose a vocal effort level for the synthetic voice that is both intelligible and socially suited to a specific interaction. 

    Fulltekst (pdf)
    fulltext
  • 41. Székely, Éva
    et al.
    Steiner, Ingmar
    Ahmed, Zeeshan
    Carson-Berndsen, Julie
    Facial expression-based affective speech translation2014Inngår i: Journal on Multimodal User Interfaces, ISSN 1783-7677, E-ISSN 1783-8738, Vol. 8, nr 1, s. 87-96Artikkel i tidsskrift (Fagfellevurdert)
    Abstract [en]

    One of the challenges of speech-to-speech trans- lation is to accurately preserve the paralinguistic informa- tion in the speaker’s message. Information about affect and emotional intent of a speaker are often carried in more than one modality. For this reason, the possibility of multimodal interaction with the system and the conversation partner may greatly increase the likelihood of a successful and gratifying communication process. In this work we explore the use of automatic facial expression analysis as an input annotation modality to transfer paralinguistic information at a symbolic level from input to output in speech-to-speech translation. To evaluate the feasibility of this approach, a prototype sys- tem, FEAST (facial expression-based affective speech trans- lation) has been developed. FEAST classifies the emotional state of the user and uses it to render the translated output in an appropriate voice style, using expressive speech synthesis. 

  • 42.
    Székely, Éva
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Wagner, Petra
    KTH.
    Gustafson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    THE WRYLIE-BOARD: MAPPING ACOUSTIC SPACE OF EXPRESSIVE FEEDBACK TO ATTITUDE MARKERS2018Inngår i: Proc. IEEE Spoken Language Technology conference, 2018Konferansepaper (Fagfellevurdert)
    Fulltekst (pdf)
    fulltext
  • 43.
    Székely, Éva
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Wang, Siyang
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafsson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    So-to-Speak: an exploratory platform for investigating the interplay between style and prosody in TTS2023Inngår i: Interspeech 2023, International Speech Communication Association , 2023, s. 2016-2017Konferansepaper (Fagfellevurdert)
    Abstract [en]

    In recent years, numerous speech synthesis systems have been proposed that feature multi-dimensional controllability, generating a level of variability that surpasses traditional TTS systems by orders of magnitude. However, it remains challenging for developers to comprehend and demonstrate the potential of these advanced systems. We introduce So-to-Speak, a customisable interface tailored for showcasing the capabilities of different controllable TTS systems. The interface allows for the generation, synthesis, and playback of hundreds of samples simultaneously, displayed on an interactive grid, with variation both low level prosodic features and high level style controls. To offer insights into speech quality, automatic estimates of MOS scores are presented for each sample. So-to-Speak facilitates the audiovisual exploration of the interaction between various speech features, which can be useful in a range of applications in speech technology.

  • 44.
    Torre, Ilaria
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Robotik, perception och lärande, RPL. Chalmers Univ Technol, Dept Comp Sci & Engn, Gothenburg, Sweden.
    Lagerstedt, Erik
    Univ Skövde, Sch Informat, Skövde, Sweden..
    Dennler, Nathaniel
    Univ Southern Calif, Dept Comp Sci, Los Angeles, CA 90007 USA..
    Seaborn, Katie
    Tokyo Inst Technol, Dept Ind Engn & Econ, Tokyo, Japan..
    Leite, Iolanda
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Robotik, perception och lärande, RPL.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Can a gender-ambiguous voice reduce gender stereotypes in human-robot interactions?2023Inngår i: 2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN, Institute of Electrical and Electronics Engineers (IEEE) , 2023, s. 106-112Konferansepaper (Fagfellevurdert)
    Abstract [en]

    When deploying robots, its physical characteristics, role, and tasks are often fixed. Such factors can also be associated with gender stereotypes among humans, which then transfer to the robots. One factor that can induce gendering but is comparatively easy to change is the robot's voice. Designing voice in a way that interferes with fixed factors might therefore be a way to reduce gender stereotypes in human-robot interaction contexts. To this end, we have conducted a video-based online study to investigate how factors that might inspire gendering of a robot interact. In particular, we investigated how giving the robot a gender-ambiguous voice can affect perception of the robot. We compared assessments (n=111) of videos in which a robot's body presentation and occupation mis/matched with human gender stereotypes. We found evidence that a gender-ambiguous voice can reduce gendering of a robot endowed with stereotypically feminine or masculine attributes. The results can inform more just robot design while opening new questions regarding the phenomenon of robot gendering.

  • 45. Wagner, Petra
    et al.
    Beskow, Jonas
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Betz, Simon
    Edlund, Jens
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Le Maguer, Sébastien
    Malisz, Zofia
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Tånnander, Christina
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Speech Synthesis Evaluation: State-of-the-Art Assessment and Suggestion for a Novel Research Program2019Inngår i: Proceedings of the 10th Speech Synthesis Workshop (SSW10), 2019Konferansepaper (Fagfellevurdert)
  • 46.
    Wang, Siyang
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Alexanderson, Simon
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafsson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Beskow, Jonas
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Integrated Speech and Gesture Synthesis2021Inngår i: ICMI 2021 - Proceedings of the 2021 International Conference on Multimodal Interaction, Association for Computing Machinery (ACM) , 2021, s. 177-185Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline. This can lead to modeling inefficiencies and may introduce inconsistencies that limit the achievable naturalness. We propose to instead synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG). We also propose a set of models modified from state-of-the-art neural speech-synthesis engines to achieve this goal. We evaluate the models in three carefully-designed user studies, two of which evaluate the synthesized speech and gesture in isolation, plus a combined study that evaluates the models like they will be used in real-world applications - speech and gesture presented together. The results show that participants rate one of the proposed integrated synthesis models as being as good as the state-of-the-art pipeline system we compare against, in all three tests. The model is able to achieve this with faster synthesis time and greatly reduced parameter count compared to the pipeline system, illustrating some of the potential benefits of treating speech and gesture synthesis together as a single, unified problem.

    Fulltekst (pdf)
    fulltext
  • 47.
    Wang, Siyang
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafsson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Evaluating Sampling-based Filler Insertion with Spontaneous TTS2022Inngår i: LREC 2022: Thirteen International Conference On Language Resources And Evaluation / [ed] Calzolari, N Bechet, F Blache, P Choukri, K Cieri, C Declerck, T Goggi, S Isahara, H Maegaard, B Mazo, H Odijk, H Piperidis, S, European Language Resources Association (ELRA) , 2022, s. 1960-1969Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Inserting fillers (such as "um", "like") to clean speech text has a rich history of study. One major application is to make dialogue systems sound more spontaneous. The ambiguity of filler occurrence and inter-speaker difference make both modeling and evaluation difficult. In this paper, we study sampling-based filler insertion, a simple yet unexplored approach to inserting fillers. We propose an objective score called Filler Perplexity (FPP). We build three models trained on two single-speaker spontaneous corpora, and evaluate them with FPP and perceptual tests. We implement two innovations in perceptual tests, (1) evaluating filler insertion on dialogue systems output, (2) synthesizing speech with neural spontaneous TTS engines. FPP proves to be useful in analysis but does not correlate well with perceptual MOS. Perceptual results show little difference between compared filler insertion models including with ground-truth, which may be due to the ambiguity of what is good filler insertion and a strong neural spontaneous TTS that produces natural speech irrespective of input. Results also show preference for filler-inserted speech synthesized with spontaneous TTS. The same test using TTS based on read speech obtains the opposite results, which shows the importance of using spontaneous TTS in evaluating filler insertions. Audio samples: www.speech.kth.se/tts- demos/LREC22

  • 48.
    Wang, Siyang
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafsson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS2023Inngår i: ICASSPW 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2023Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts

  • 49.
    Wang, Siyang
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Henter, Gustav Eje
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Gustafsson, Joakim
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    A comparative study of self-supervised speech representationsin read and spontaneous TTS2023Manuskript (preprint) (Annet vitenskapelig)
    Abstract [en]

    Recent work has explored using self-supervised learning(SSL) speech representations such as wav2vec2.0 as the rep-resentation medium in standard two-stage TTS, in place ofconventionally used mel-spectrograms. It is however unclearwhich speech SSL is the better fit for TTS, and whether ornot the performance differs between read and spontaneousTTS, the later of which is arguably more challenging. Thisstudy aims at addressing these questions by testing severalspeech SSLs, including different layers of the same SSL, intwo-stage TTS on both read and spontaneous corpora, whilemaintaining constant TTS model architecture and trainingsettings. Results from listening tests show that the 9th layerof 12-layer wav2vec2.0 (ASR finetuned) outperforms othertested SSLs and mel-spectrogram, in both read and sponta-neous TTS. Our work sheds light on both how speech SSL canreadily improve current TTS systems, and how SSLs comparein the challenging generative task of TTS. Audio examplescan be found at https://www.speech.kth.se/tts-demos/ssr tts

    Fulltekst (pdf)
    fulltext
  • 50.
    Ward, Nigel
    et al.
    University of Texas at El Paso.
    Kirkland, Ambika
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Wlodarczak, Marcin
    Stockholm University.
    Székely, Éva
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.
    Two Pragmatic Functions of Breathy Voice in American English Conversation2022Inngår i: Proceedings 11th International Conference on Speech Prosody / [ed] Sónia Frota, Marisa Cruz and Marina Vigário, International Speech Communication Association, 2022, s. 82-86Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Although the paralinguistic and phonological significance of breathy voice is well known, its pragmatic roles have been little studied. We report a systematic exploration of the pragmatic functions of breathy voice in American English, using a small corpus of casual conversations, using the Cepstral Peak Prominence Smoothed measure as an indicator of breathy voice, and using a common workflow to find prosodic constructions and identify their meanings. We found two prosodic constructions involving breathy voice. The first involves a short region of breathy voice in the midst of a region of low pitch, functioning to mark self-directed speech. The second involves breathy voice over several seconds, combined with a moment of wider pitch range leading to a high pitch over about a second, functioning to mark an attempt to establish common ground. These interpretations were confirmed by a perception experiment.

    Fulltekst (pdf)
    fulltext
1 - 50 of 50
RefereraExporteraLink til resultatlisten
Permanent link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf