kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Evaluating Sampling-based Filler Insertion with Spontaneous TTS
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-0397-6442
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1175-840X
2022 (English)In: LREC 2022: Thirteen International Conference On Language Resources And Evaluation / [ed] Calzolari, N Bechet, F Blache, P Choukri, K Cieri, C Declerck, T Goggi, S Isahara, H Maegaard, B Mazo, H Odijk, H Piperidis, S, European Language Resources Association (ELRA) , 2022, p. 1960-1969Conference paper, Published paper (Refereed)
Abstract [en]

Inserting fillers (such as "um", "like") to clean speech text has a rich history of study. One major application is to make dialogue systems sound more spontaneous. The ambiguity of filler occurrence and inter-speaker difference make both modeling and evaluation difficult. In this paper, we study sampling-based filler insertion, a simple yet unexplored approach to inserting fillers. We propose an objective score called Filler Perplexity (FPP). We build three models trained on two single-speaker spontaneous corpora, and evaluate them with FPP and perceptual tests. We implement two innovations in perceptual tests, (1) evaluating filler insertion on dialogue systems output, (2) synthesizing speech with neural spontaneous TTS engines. FPP proves to be useful in analysis but does not correlate well with perceptual MOS. Perceptual results show little difference between compared filler insertion models including with ground-truth, which may be due to the ambiguity of what is good filler insertion and a strong neural spontaneous TTS that produces natural speech irrespective of input. Results also show preference for filler-inserted speech synthesized with spontaneous TTS. The same test using TTS based on read speech obtains the opposite results, which shows the importance of using spontaneous TTS in evaluating filler insertions. Audio samples: www.speech.kth.se/tts- demos/LREC22

Place, publisher, year, edition, pages
European Language Resources Association (ELRA) , 2022. p. 1960-1969
Keywords [en]
filler insertion, spontaneous text-to-speech, spoken dialogue system
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:kth:diva-324340ISI: 000889371702007Scopus ID: 2-s2.0-85144345531OAI: oai:DiVA.org:kth-324340DiVA, id: diva2:1740010
Conference
13th International Conference on Language Resources and Evaluation (LREC), JUN 20-25, 2022, Marseille, FRANCE
Note

QC 20230228

Available from: 2023-02-28 Created: 2023-02-28 Last updated: 2023-06-21Bibliographically approved

Open Access in DiVA

No full text in DiVA

Scopus

Authority records

Wang, SiyangGustafsson, JoakimSzékely, Éva

Search in DiVA

By author/editor
Wang, SiyangGustafsson, JoakimSzékely, Éva
By organisation
Speech, Music and Hearing, TMH
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 54 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf