kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A hybrid harmonics-and-bursts modelling approach to speech synthesis
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1399-6604
STTS Speech Technology Services, Stockholm, Sweden.
Number of Authors: 22016 (English)In: Proceedings 9th ISCA Speech Synthesis Workshop, SSW 2016, The International Society for Computers and Their Applications (ISCA) , 2016, p. 208-213Conference paper, Published paper (Refereed)
Abstract [en]

Statistical speech synthesis systems rely on a parametric speech generation model, typically some sort of vocoder. Vocoders are great for voiced speech because they offer independent control over voice source (e.g. pitch) and vocal tract filter (e.g. vowel quality) through control parameters that typically vary smoothly in time and lend themselves well to statistical modelling. Voiceless sounds and transients such as plosives and fricatives on the other hand exhibit fundamentally different spectro-temporal behaviour. Here the benefits of the vocoder are not as clear. In this paper, we investigate a hybrid approach to modeling the speech signal, where speech is decomposed into an harmonic part and a noise burst part through spectrogram kernel filtering. The harmonic part is modeled using vocoder and statistical parameter generation, while the burst part is modeled by concatenation. The two channels are then mixed together to form the final synthesized waveform. The proposed method was compared against a state of the art statistical speech synthesis system (HTS 2.3) in a perceptual evaluation, which reveled that the harmonics plus bursts method was perceived as significantly more natural than the purely statistical variant.

Place, publisher, year, edition, pages
The International Society for Computers and Their Applications (ISCA) , 2016. p. 208-213
Keywords [en]
concatenation, crowd source evaluation, hybrid speech synthesis, spectro-temporal filtering, statistical speech synthesis
National Category
Signal Processing
Identifiers
URN: urn:nbn:se:kth:diva-332098Scopus ID: 2-s2.0-85113587996OAI: oai:DiVA.org:kth-332098DiVA, id: diva2:1783398
Conference
9th ISCA Speech Synthesis Workshop, SSW 2016, Sunnyvale, United States of America, Sep 15 2016 - Sep 13 2016
Note

QC 20230720

Available from: 2023-07-20 Created: 2023-07-20 Last updated: 2023-07-20Bibliographically approved

Open Access in DiVA

No full text in DiVA

Scopus

Authority records

Beskow, Jonas

Search in DiVA

By author/editor
Beskow, Jonas
By organisation
Speech, Music and Hearing, TMH
Signal Processing

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 80 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf