kth.sePublikationer KTH
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Stress manipulation in text-to-speech synthesis using speaking rate categories
KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH, Tal-kommunikation.ORCID-id: 0000-0002-9659-1532
KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Tal, musik och hörsel, TMH.ORCID-id: 0000-0001-9327-9482
2021 (Engelska)Ingår i: Proceedings of Fonetik 2021, Centre for Languages and Literature, Lund University / [ed] Anna Hjortdal and Mikael Roll, Lund, 2021, Vol. 56, s. 17-22Konferensbidrag, Publicerat paper (Övrigt vetenskapligt)
Abstract [en]

The challenge of controlling prosody in text-to-speech systems (TTS) is as old as TTS itself. The problem is not just to know what the desired stress or intonation patterns are, nor is it limited to knowing how to control specific speech parameters (e.g. durations, amplitude and fundamental frequency). We also need to know the precise speech parameters settings that correspond to a certain stress or intonation pattern ±over entire utterances.We propose that the powerful TTS models afforded by deep neural networks (DNN¶s), combined with the fact that speech parameters often are correlated and vary in orchestration, allow us to solve at least some stress and intonation parts by influencing a single easy-to-controlparameter, rather than detailed control over many parameters.The paper presents a straightforward method of guiding word durations without recording training material especially for this purpose. The resulting TTS engine is used to produce sentences containing Swedish words that are unstressed in their most common function, but stressed in another common function. The sentences are designed so that it is clear to a listener that the second function is the intended. In these cases, TTS engines often fail and produce an unstressed version.A group of 20 listeners compared samples that the TTS produced without guidance with samples where it was instructed to slow down the test words. The listeners almost unanimously preferred the latter version. This supports the notion that due to the orchestrated variation of speech characteristics and the strength of modern DNN models, we can provide prosodic guidance to DNN-based TTS systems without having to control every characteristic in detail.

Ort, förlag, år, upplaga, sidor
Lund, 2021. Vol. 56, s. 17-22
Nationell ämneskategori
Annan teknik
Forskningsämne
Tal- och musikkommunikation
Identifikatorer
URN: urn:nbn:se:kth:diva-304363OAI: oai:DiVA.org:kth-304363DiVA, id: diva2:1608077
Konferens
Fonetik 2021, Date 8-9 June 2021
Anmärkning

QC 20211216

Tillgänglig från: 2021-11-02 Skapad: 2021-11-02 Senast uppdaterad: 2025-02-10Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Övriga länkar

Published fulltext

Person

Tånnander, ChristinaEdlund, Jens

Sök vidare i DiVA

Av författaren/redaktören
Tånnander, ChristinaEdlund, Jens
Av organisationen
Tal-kommunikationTal, musik och hörsel, TMH
Annan teknik

Sök vidare utanför DiVA

GoogleGoogle Scholar

urn-nbn

Altmetricpoäng

urn-nbn
Totalt: 113 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf