kth.sePublications
Change search
Link to record
Permanent link

Direct link
Tånnander, Christina, DoktorandORCID iD iconorcid.org/0000-0002-9659-1532
Publications (10 of 26) Show all publications
Edlund, J., Tånnander, C., Le Maguer, S. & Wagner, P. (2024). Assessing the impact of contextual framing on subjective TTS quality. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024 (pp. 1205-1209). International Speech Communication Association
Open this publication in new window or tab >>Assessing the impact of contextual framing on subjective TTS quality
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 1205-1209Conference paper, Published paper (Refereed)
Abstract [en]

Text-To-Speech (TTS) evaluations are habitually carried out without contextual and situational framing. Since humans adapt their speaking style to situation specific communicative needs, such evaluations may not generalize across situations. Without clearly defined framing, it is even unclear in which situations evaluation results hold at all. We test the hypothesized impact of framing on TTS evaluation in a crowdsourced MOS evaluation of four TTS voices, systematically varying (a) the intended TTS task (domestic humanoid robot, child's voice replacement, fiction audio books and long and information-rich texts) and (b) the framing of that task. The results show that framing differentiated MOS responses, with individual TTS performance varying significantly across tasks and framings. This corroborates the assumption that decontextualized MOS evaluations do not generalize, and suggests that TTS evaluations should not be reported without the type of framing that was employed, if any.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
evaluation, framing, methodology, MOS
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-358870 (URN)10.21437/Interspeech.2024-781 (DOI)2-s2.0-85214812427 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024
Note

QC 20250127

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-01-27Bibliographically approved
Tånnander, C., Mehta, S., Beskow, J. & Edlund, J. (2024). Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2815-2819). International Speech Communication Association
Open this publication in new window or tab >>Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 2815-2819Conference paper, Published paper (Refereed)
Abstract [en]

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
analysis-by-synthesis, controllable text-to-speech synthesis, phonological features
National Category
Natural Language Processing Computer Sciences
Identifiers
urn:nbn:se:kth:diva-358877 (URN)10.21437/Interspeech.2024-1565 (DOI)2-s2.0-85214785956 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250128

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-01-28Bibliographically approved
Tånnander, C., O'Regan, J., House, D., Edlund, J. & Beskow, J. (2024). Prosodic characteristics of English-accented Swedish neural TTS. In: Proceedings of Speech Prosody 2024: . Paper presented at Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024 (pp. 1035-1039). Leiden, The Netherlands: International Speech Communication Association
Open this publication in new window or tab >>Prosodic characteristics of English-accented Swedish neural TTS
Show others...
2024 (English)In: Proceedings of Speech Prosody 2024, Leiden, The Netherlands: International Speech Communication Association , 2024, p. 1035-1039Conference paper, Published paper (Refereed)
Abstract [en]

Neural text-to-speech synthesis (TTS) captures prosodicfeatures strikingly well, notwithstanding the lack of prosodiclabels in training or synthesis. We trained a voice on a singleSwedish speaker reading in Swedish and English. The resultingTTS allows us to control the degree of English-accentedness inSwedish sentences. English-accented Swedish commonlyexhibits well-known prosodic characteristics such as erroneoustonal accents and understated or missed durational differences.TTS quality was verified in three ways. Automatic speechrecognition resulted in low errors, verifying intelligibility.Automatic language classification had Swedish as the majoritychoice, while the likelihood of English increased with ourtargeted degree of English-accentedness. Finally, a rank ofperceived English-accentedness acquired through pairwisecomparisons by 20 human listeners demonstrated a strongcorrelation with the targeted English-accentedness.We report on phonetic and prosodic analyses of theaccented TTS. In addition to the anticipated segmentaldifferences, the analyses revealed temporal and prominencerelated variations coherent with Swedish spoken by Englishspeakers, such as missing Swedish stress patterns and overlyreduced unstressed syllables. With this work, we aim to gleaninsights into speech prosody from the latent prosodic featuresof neural TTS models. In addition, it will help implementspeech phenomena such as code switching in TTS

Place, publisher, year, edition, pages
Leiden, The Netherlands: International Speech Communication Association, 2024
Keywords
foreign-accented text-to-speech synthesis, neural text-to-speech synthesis, latent prosodic features
National Category
Humanities and the Arts General Language Studies and Linguistics
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-349946 (URN)10.21437/SpeechProsody.2024-209 (DOI)
Conference
Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024
Projects
Deep learning based speech synthesis for reading aloud of lengthy and information rich texts in Swedish (2018-02427)Språkbanken Tal (2017-00626)
Funder
Vinnova, (2018-02427
Note

QC 20240705

Available from: 2024-07-03 Created: 2024-07-03 Last updated: 2024-07-05Bibliographically approved
Tånnander, C., Edlund, J. & Gustafsson, J. (2024). Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System. In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings: . Paper presented at Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024 (pp. 14111-14121). European Language Resources Association (ELRA)
Open this publication in new window or tab >>Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System
2024 (English)In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, European Language Resources Association (ELRA) , 2024, p. 14111-14121Conference paper, Published paper (Refereed)
Abstract [en]

In order to investigate the strengths and weaknesses of Audience Response System (ARS) in text-to-speech synthesis (TTS) evaluations, we revisit three previously published TTS studies and perform an ARS-based evaluation on the stimuli used in each study. The experiments are performed with a participant pool of 39 respondents, using a web-based tool that emulates an ARS experiment. The results of the first experiment confirms that ARS is highly useful for evaluating long and continuous stimuli, particularly if we wish for a diagnostic result rather than a single overall metric, while the second and third experiments highlight weaknesses in ARS with unsuitable materials as well as the importance of framing and instruction when conducting ARS-based evaluation.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2024
Keywords
audience response system, evaluation methodology, TTS evaluation
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-348784 (URN)2-s2.0-85195897862 (Scopus ID)
Conference
Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024
Note

Part of ISBN 9782493814104

QC 20240701

Available from: 2024-06-27 Created: 2024-06-27 Last updated: 2025-02-07Bibliographically approved
Tånnander, C., House, D. & Edlund, J. (2023). Analysis-by-synthesis: phonetic-phonological variation indeep neural network-based text-to-speech synthesis. In: Radek Skarnitzl and Jan Volín (Ed.), Proceedings of the 20th International Congress of Phonetic Sciences, Prague 2023: . Paper presented at 20th International Congress of Phonetic Sciences (ICPhS), August 7-11, 2023, Prague, Czech Republic (pp. 3156-3160). Prague, Czech Republic: GUARANT International
Open this publication in new window or tab >>Analysis-by-synthesis: phonetic-phonological variation indeep neural network-based text-to-speech synthesis
2023 (English)In: Proceedings of the 20th International Congress of Phonetic Sciences, Prague 2023 / [ed] Radek Skarnitzl and Jan Volín, Prague, Czech Republic: GUARANT International , 2023, p. 3156-3160Conference paper, Published paper (Refereed)
Abstract [en]

Text-to-speech synthesis based on deep neuralnetworks can generate highly humanlike speech,which revitalizes the potential for analysis-bysynthesis in speech research. We propose that neuralsynthesis can provide evidence that a specificdistinction in its transcription system represents arobust acoustic/phonetic distinction in the speechused to train the model.We synthesized utterances with allophones inincorrect contexts and analyzed the resultsphonetically. Our assumption was that if we gainedcontrol over the allophonic variation in this way, itwould provide strong evidence that the variation isgoverned robustly by the phonological context usedto create the transcriptions.Of three allophonic variations investigated, thefirst, which was believed to be quite robust, gave usrobust control over the variation, while the other two,which are less categorical, did not afford us suchcontrol. These findings are consistent with ourhypothesis and support the notion that neural TTS canbe a valuable analysis-by-synthesis tool for speechresearch. 

Place, publisher, year, edition, pages
Prague, Czech Republic: GUARANT International, 2023
Keywords
analysis-by-synthesis, latent phonetic features, phonological variation, neural TTS
National Category
Other Engineering and Technologies
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-336586 (URN)
Conference
20th International Congress of Phonetic Sciences (ICPhS), August 7-11, 2023, Prague, Czech Republic
Funder
Vinnova, 2018-02427
Note

Part of ISBN 978-80-908 114-2-3

QC 20230915

Available from: 2023-09-14 Created: 2023-09-14 Last updated: 2025-02-10Bibliographically approved
Tånnander, C. & Edlund, J. (2022). Mapping specific characteristics of spoken text to listener ratings. In: Proceedings of Fonetik 2022: . Paper presented at fonetikmötet, Fonetik 2022, 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan. Stockholm, Sweden
Open this publication in new window or tab >>Mapping specific characteristics of spoken text to listener ratings
2022 (English)In: Proceedings of Fonetik 2022, Stockholm, Sweden, 2022Conference paper, Published paper (Other academic)
Place, publisher, year, edition, pages
Stockholm, Sweden: , 2022
National Category
Natural Language Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-314986 (URN)
Conference
fonetikmötet, Fonetik 2022, 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan
Note

QCR 20220628

Available from: 2022-06-27 Created: 2022-06-27 Last updated: 2025-02-07Bibliographically approved
Tånnander, C. & Edlund, J. (2022). Sardin: speech-oriented text processing. In: Proceedings of Fonetik 2022: . Paper presented at fonetikmötet, Fonetik 2022, 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan. Stockholm, Sweden
Open this publication in new window or tab >>Sardin: speech-oriented text processing
2022 (English)In: Proceedings of Fonetik 2022, Stockholm, Sweden, 2022Conference paper, Published paper (Other academic)
Place, publisher, year, edition, pages
Stockholm, Sweden: , 2022
National Category
Natural Language Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-314985 (URN)
Conference
fonetikmötet, Fonetik 2022, 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan
Note

QCR 20220628

Available from: 2022-06-27 Created: 2022-06-27 Last updated: 2025-02-07Bibliographically approved
Tånnander, C., House, D. & Edlund, J. (2022). Syllable duration as a proxy to latent prosodic features. In: Proceedings of Speech Prosody 2022: . Paper presented at Speech Prosody 2022 23-26 May 2022, Lisbon, Portugal (pp. 220-224). Lisbon, Portugal: International Speech Communication Association
Open this publication in new window or tab >>Syllable duration as a proxy to latent prosodic features
2022 (English)In: Proceedings of Speech Prosody 2022, Lisbon, Portugal: International Speech Communication Association , 2022, p. 220-224Conference paper, Published paper (Refereed)
Abstract [en]

Recent advances in deep-learning have pushed text-to-speech synthesis (TTS) very close to human speech. In deep-learning, latent features refer to features that are hidden from us; notwithstanding, we may meaningfully observe their effects. Analogously, latent prosodic features refer to the exact features that constitute e.g. prominence that are unknown to us, although we know (some of) the functions of prominence and (some of) its acoustic correlates. Deep-learned speech models capture prosody well, but leave us with little control and few insights. Previously, we explored average syllable duration on word level - a simple and accessible metric - as a proxy for prominence: in Swedish TTS, where verb particles and numerals tend to receive too little prominence, these were nudged towards lengthening while allowing the TTS models to otherwise operate freely. Listener panels overwhelmingly preferred the nudged versions to the unmodified TTS. In this paper, we analyse utterances from the modified TTS. The analysis shows that duration-nudging of relevant words changes the following features in an observable manner: duration is predictably lengthened, word-initial glottalization occurs, and the general intonation pattern changes. This supports the view of latent prosodic features that can be reflected in deep-learned models and accessed by proxy.

Place, publisher, year, edition, pages
Lisbon, Portugal: International Speech Communication Association, 2022
National Category
Other Humanities not elsewhere specified
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-314984 (URN)10.21437/SpeechProsody.2022-45 (DOI)2-s2.0-85166333598 (Scopus ID)
Conference
Speech Prosody 2022 23-26 May 2022, Lisbon, Portugal
Note

QC 20220628

Available from: 2022-06-27 Created: 2022-06-27 Last updated: 2024-08-28Bibliographically approved
Tånnander, C. & Edlund, J. (2022). Towards a Swedish test set for speech-oriented text normalisation. In: : . Paper presented at Swedish Language Technology Conference (SLTC),November 18-20 2020, Göteborg. Göteborg: Göteborgs universitet
Open this publication in new window or tab >>Towards a Swedish test set for speech-oriented text normalisation
2022 (English)Conference paper, Published paper (Other academic)
Abstract [en]

Text-to-speech synthesis (TTS) can be split into two steps: the preprocessor, which takes input text, including its encoding and formatting, and turns it into a representation that is accepted by the synthesizer, which in turn converts this representation into an acoustic waveform representing speech. TTS is commonly evaluated in terms of how intelligible or humanlike the speech is, where different synthesizers working on the same input representation are regularly compared, whereas the preprocessing is habitually ignored in TTS evaluation. Were we to evaluate preprocessing, we could evaluate it as a whole (e.g. compare its output for some input representation to a target phonemic representation) or as individual processes such as sentence detection, tokenisation, text normalisation (TN) and pronunciation generation.This paper focuses on the evaluation of speech-oriented text normalisation (STN), that is the conversion of the input text into an expanded string of the words to be spoken, for example expansions of. abbreviations and different types of numerals. It is a request for comments for the creation of a test set for the evaluation of Swedish STN, which can be used as a baseline for future STN models, and as part of the overall evaluation of Swedish speech-oriented preprocessing.

Place, publisher, year, edition, pages
Göteborg: Göteborgs universitet, 2022
Keywords
speech-oriented text processing, test set
National Category
Natural Language Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-323669 (URN)
Conference
Swedish Language Technology Conference (SLTC),November 18-20 2020, Göteborg
Funder
Vinnova, 2018-02427
Note

QC 20230215

Available from: 2023-02-08 Created: 2023-02-08 Last updated: 2025-02-07Bibliographically approved
Tånnander, C. & Edlund, J. (2021). Methods of slowing down speech. In: Proceedings. 11th ISCA Speech Synthesis Workshop (SSW 11): . Paper presented at ISCA Speech Synthesis Workshop, August 26-28 2021 Budapest (pp. 43-47).
Open this publication in new window or tab >>Methods of slowing down speech
2021 (English)In: Proceedings. 11th ISCA Speech Synthesis Workshop (SSW 11), 2021, p. 43-47Conference paper, Published paper (Refereed)
Abstract [en]

A slower speaking rate of human or synthetic speech is often requested by for example language learners or people with aphasia or dementia. Slow speech produced by human speakers typically contain a larger number of pauses, and both pauses and speech have longer segment durations than speech produced at a standard or fast speaking rate. This paper presents several methods of prolonging speech. Two speech chunks of about 30 seconds each, read by a professional voice talent at a very slow speaking rate, were used as reference. Seven pairs of stimuli containing the same word sequences were produced, one by the same professional, reading at her standard speaking rate and six by a moderately slow synthetic voice trained on the same human voice. Different combinations of pause insertions and stretching were used to match the total length of the corresponding reference stimulus. Stretching was applied in different proportions to speech and non-speech, and pauses were inserted at punctuations, at certain phrase boundaries, between each word, or by copying the pause locations of the reference reading. 128 crowdsourced listeners evaluated the 16 stimuli. The results show that all manipulated readings are less consistent with expectations of slow speech than the reference, but that the synthesised readings are comparable to stretched human speech. Key factors are the relation between speech and silence and the duration of talkspurts.

National Category
Other Engineering and Technologies
Identifiers
urn:nbn:se:kth:diva-304364 (URN)10.21437/SSW.2021-8 (DOI)
Conference
ISCA Speech Synthesis Workshop, August 26-28 2021 Budapest
Funder
Vinnova, 2018-02427
Note

QC 20211125

Available from: 2021-11-02 Created: 2021-11-02 Last updated: 2025-02-10Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-9659-1532

Search in DiVA

Show all publications