kth.sePublications
Change search
Link to record
Permanent link

Direct link
Tånnander, Christina, DoktorandORCID iD iconorcid.org/0000-0002-9659-1532
Publications (10 of 22) Show all publications
Tånnander, C., House, D. & Edlund, J. (2023). Analysis-by-synthesis: phonetic-phonological variation indeep neural network-based text-to-speech synthesis. In: Radek Skarnitzl and Jan Volín (Ed.), Proceedings of the 20th International Congress of Phonetic Sciences, Prague 2023: . Paper presented at 20th International Congress of Phonetic Sciences (ICPhS), August 7-11, 2023, Prague, Czech Republic (pp. 3156-3160). Prague, Czech Republic: GUARANT International
Open this publication in new window or tab >>Analysis-by-synthesis: phonetic-phonological variation indeep neural network-based text-to-speech synthesis
2023 (English)In: Proceedings of the 20th International Congress of Phonetic Sciences, Prague 2023 / [ed] Radek Skarnitzl and Jan Volín, Prague, Czech Republic: GUARANT International , 2023, p. 3156-3160Conference paper, Published paper (Refereed)
Abstract [en]

Text-to-speech synthesis based on deep neuralnetworks can generate highly humanlike speech,which revitalizes the potential for analysis-bysynthesis in speech research. We propose that neuralsynthesis can provide evidence that a specificdistinction in its transcription system represents arobust acoustic/phonetic distinction in the speechused to train the model.We synthesized utterances with allophones inincorrect contexts and analyzed the resultsphonetically. Our assumption was that if we gainedcontrol over the allophonic variation in this way, itwould provide strong evidence that the variation isgoverned robustly by the phonological context usedto create the transcriptions.Of three allophonic variations investigated, thefirst, which was believed to be quite robust, gave usrobust control over the variation, while the other two,which are less categorical, did not afford us suchcontrol. These findings are consistent with ourhypothesis and support the notion that neural TTS canbe a valuable analysis-by-synthesis tool for speechresearch. 

Place, publisher, year, edition, pages
Prague, Czech Republic: GUARANT International, 2023
Keywords
analysis-by-synthesis, latent phonetic features, phonological variation, neural TTS
National Category
Other Engineering and Technologies not elsewhere specified
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-336586 (URN)
Conference
20th International Congress of Phonetic Sciences (ICPhS), August 7-11, 2023, Prague, Czech Republic
Funder
Vinnova, 2018-02427
Note

Part of ISBN 978-80-908 114-2-3

QC 20230915

Available from: 2023-09-14 Created: 2023-09-14 Last updated: 2023-09-15Bibliographically approved
Tånnander, C. & Edlund, J. (2022). Mapping specific characteristics of spoken text to listener ratings. In: Proceedings of Fonetik 2022: . Paper presented at fonetikmötet, Fonetik 2022, 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan. Stockholm, Sweden
Open this publication in new window or tab >>Mapping specific characteristics of spoken text to listener ratings
2022 (English)In: Proceedings of Fonetik 2022, Stockholm, Sweden, 2022Conference paper, Published paper (Other academic)
Place, publisher, year, edition, pages
Stockholm, Sweden: , 2022
National Category
Language Technology (Computational Linguistics)
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-314986 (URN)
Conference
fonetikmötet, Fonetik 2022, 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan
Note

QCR 20220628

Available from: 2022-06-27 Created: 2022-06-27 Last updated: 2022-06-28Bibliographically approved
Tånnander, C. & Edlund, J. (2022). Sardin: speech-oriented text processing. In: Proceedings of Fonetik 2022: . Paper presented at fonetikmötet, Fonetik 2022, 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan. Stockholm, Sweden
Open this publication in new window or tab >>Sardin: speech-oriented text processing
2022 (English)In: Proceedings of Fonetik 2022, Stockholm, Sweden, 2022Conference paper, Published paper (Other academic)
Place, publisher, year, edition, pages
Stockholm, Sweden: , 2022
National Category
Language Technology (Computational Linguistics)
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-314985 (URN)
Conference
fonetikmötet, Fonetik 2022, 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan
Note

QCR 20220628

Available from: 2022-06-27 Created: 2022-06-27 Last updated: 2022-06-28Bibliographically approved
Tånnander, C., House, D. & Edlund, J. (2022). Syllable duration as a proxy to latent prosodic features. In: Proceedings of Speech Prosody 2022: . Paper presented at Speech Prosody 2022 23-26 May 2022, Lisbon, Portugal (pp. 220-224). Lisbon, Portugal: International Speech Communication Association
Open this publication in new window or tab >>Syllable duration as a proxy to latent prosodic features
2022 (English)In: Proceedings of Speech Prosody 2022, Lisbon, Portugal: International Speech Communication Association , 2022, p. 220-224Conference paper, Published paper (Refereed)
Abstract [en]

Recent advances in deep-learning have pushed text-to-speech synthesis (TTS) very close to human speech. In deep-learning, latent features refer to features that are hidden from us; notwithstanding, we may meaningfully observe their effects. Analogously, latent prosodic features refer to the exact features that constitute e.g. prominence that are unknown to us, although we know (some of) the functions of prominence and (some of) its acoustic correlates. Deep-learned speech models capture prosody well, but leave us with little control and few insights. Previously, we explored average syllable duration on word level - a simple and accessible metric - as a proxy for prominence: in Swedish TTS, where verb particles and numerals tend to receive too little prominence, these were nudged towards lengthening while allowing the TTS models to otherwise operate freely. Listener panels overwhelmingly preferred the nudged versions to the unmodified TTS. In this paper, we analyse utterances from the modified TTS. The analysis shows that duration-nudging of relevant words changes the following features in an observable manner: duration is predictably lengthened, word-initial glottalization occurs, and the general intonation pattern changes. This supports the view of latent prosodic features that can be reflected in deep-learned models and accessed by proxy.

Place, publisher, year, edition, pages
Lisbon, Portugal: International Speech Communication Association, 2022
National Category
Other Humanities not elsewhere specified
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-314984 (URN)10.21437/SpeechProsody.2022-45 (DOI)
Conference
Speech Prosody 2022 23-26 May 2022, Lisbon, Portugal
Note

QC 20220628

Available from: 2022-06-27 Created: 2022-06-27 Last updated: 2022-06-28Bibliographically approved
Tånnander, C. & Edlund, J. (2022). Towards a Swedish test set for speech-oriented text normalisation. In: : . Paper presented at Swedish Language Technology Conference (SLTC),November 18-20 2020, Göteborg. Göteborg: Göteborgs universitet
Open this publication in new window or tab >>Towards a Swedish test set for speech-oriented text normalisation
2022 (English)Conference paper, Published paper (Other academic)
Abstract [en]

Text-to-speech synthesis (TTS) can be split into two steps: the preprocessor, which takes input text, including its encoding and formatting, and turns it into a representation that is accepted by the synthesizer, which in turn converts this representation into an acoustic waveform representing speech. TTS is commonly evaluated in terms of how intelligible or humanlike the speech is, where different synthesizers working on the same input representation are regularly compared, whereas the preprocessing is habitually ignored in TTS evaluation. Were we to evaluate preprocessing, we could evaluate it as a whole (e.g. compare its output for some input representation to a target phonemic representation) or as individual processes such as sentence detection, tokenisation, text normalisation (TN) and pronunciation generation.This paper focuses on the evaluation of speech-oriented text normalisation (STN), that is the conversion of the input text into an expanded string of the words to be spoken, for example expansions of. abbreviations and different types of numerals. It is a request for comments for the creation of a test set for the evaluation of Swedish STN, which can be used as a baseline for future STN models, and as part of the overall evaluation of Swedish speech-oriented preprocessing.

Place, publisher, year, edition, pages
Göteborg: Göteborgs universitet, 2022
Keywords
speech-oriented text processing, test set
National Category
Language Technology (Computational Linguistics)
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-323669 (URN)
Conference
Swedish Language Technology Conference (SLTC),November 18-20 2020, Göteborg
Funder
Vinnova, 2018-02427
Note

QC 20230215

Available from: 2023-02-08 Created: 2023-02-08 Last updated: 2023-11-14Bibliographically approved
Tånnander, C. & Edlund, J. (2021). Methods of slowing down speech. In: Proceedings. 11th ISCA Speech Synthesis Workshop (SSW 11): . Paper presented at ISCA Speech Synthesis Workshop, August 26-28 2021 Budapest (pp. 43-47).
Open this publication in new window or tab >>Methods of slowing down speech
2021 (English)In: Proceedings. 11th ISCA Speech Synthesis Workshop (SSW 11), 2021, p. 43-47Conference paper, Published paper (Refereed)
Abstract [en]

A slower speaking rate of human or synthetic speech is often requested by for example language learners or people with aphasia or dementia. Slow speech produced by human speakers typically contain a larger number of pauses, and both pauses and speech have longer segment durations than speech produced at a standard or fast speaking rate. This paper presents several methods of prolonging speech. Two speech chunks of about 30 seconds each, read by a professional voice talent at a very slow speaking rate, were used as reference. Seven pairs of stimuli containing the same word sequences were produced, one by the same professional, reading at her standard speaking rate and six by a moderately slow synthetic voice trained on the same human voice. Different combinations of pause insertions and stretching were used to match the total length of the corresponding reference stimulus. Stretching was applied in different proportions to speech and non-speech, and pauses were inserted at punctuations, at certain phrase boundaries, between each word, or by copying the pause locations of the reference reading. 128 crowdsourced listeners evaluated the 16 stimuli. The results show that all manipulated readings are less consistent with expectations of slow speech than the reference, but that the synthesised readings are comparable to stretched human speech. Key factors are the relation between speech and silence and the duration of talkspurts.

National Category
Other Engineering and Technologies not elsewhere specified
Identifiers
urn:nbn:se:kth:diva-304364 (URN)10.21437/SSW.2021-8 (DOI)
Conference
ISCA Speech Synthesis Workshop, August 26-28 2021 Budapest
Funder
Vinnova, 2018-02427
Note

QC 20211125

Available from: 2021-11-02 Created: 2021-11-02 Last updated: 2022-06-25Bibliographically approved
Tånnander, C. & Edlund, J. (2021). Self-perceived preferences of voice and speaking style characteristics in spoken text. In: : . Paper presented at Swedish Language Technology Conference (SLTC) 2021.
Open this publication in new window or tab >>Self-perceived preferences of voice and speaking style characteristics in spoken text
2021 (English)Conference paper, Oral presentation with published abstract (Other academic)
Abstract [en]

119 respondents expressed their opinions in a survey on voice and speaking style characteristics in the context of listening experience as a step towards a better understanding of which characteristics make for a good voice. We found consensus on some characteristics (e.g. a soft voice is positive, and a forced voice is negative), but also noted that the text type seems to affect opinions (e.g. dramatic reading is preferred by some fiction listeners but disliked by university textbook listeners).

National Category
Language Technology (Computational Linguistics)
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-304367 (URN)
Conference
Swedish Language Technology Conference (SLTC) 2021
Note

QC 20211103

Available from: 2021-11-02 Created: 2021-11-02 Last updated: 2024-03-18Bibliographically approved
Tånnander, C. & Edlund, J. (2021). Stress manipulation in text-to-speech synthesis using speaking rate categories. In: Anna Hjortdal and Mikael Roll (Ed.), Proceedings of Fonetik 2021, Centre for Languages and Literature, Lund University: . Paper presented at Fonetik 2021, Date 8-9 June 2021 (pp. 17-22). Lund, 56
Open this publication in new window or tab >>Stress manipulation in text-to-speech synthesis using speaking rate categories
2021 (English)In: Proceedings of Fonetik 2021, Centre for Languages and Literature, Lund University / [ed] Anna Hjortdal and Mikael Roll, Lund, 2021, Vol. 56, p. 17-22Conference paper, Published paper (Other academic)
Abstract [en]

The challenge of controlling prosody in text-to-speech systems (TTS) is as old as TTS itself. The problem is not just to know what the desired stress or intonation patterns are, nor is it limited to knowing how to control specific speech parameters (e.g. durations, amplitude and fundamental frequency). We also need to know the precise speech parameters settings that correspond to a certain stress or intonation pattern ±over entire utterances.We propose that the powerful TTS models afforded by deep neural networks (DNN¶s), combined with the fact that speech parameters often are correlated and vary in orchestration, allow us to solve at least some stress and intonation parts by influencing a single easy-to-controlparameter, rather than detailed control over many parameters.The paper presents a straightforward method of guiding word durations without recording training material especially for this purpose. The resulting TTS engine is used to produce sentences containing Swedish words that are unstressed in their most common function, but stressed in another common function. The sentences are designed so that it is clear to a listener that the second function is the intended. In these cases, TTS engines often fail and produce an unstressed version.A group of 20 listeners compared samples that the TTS produced without guidance with samples where it was instructed to slow down the test words. The listeners almost unanimously preferred the latter version. This supports the notion that due to the orchestrated variation of speech characteristics and the strength of modern DNN models, we can provide prosodic guidance to DNN-based TTS systems without having to control every characteristic in detail.

Place, publisher, year, edition, pages
Lund: , 2021
National Category
Other Engineering and Technologies not elsewhere specified
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-304363 (URN)
Conference
Fonetik 2021, Date 8-9 June 2021
Note

QC 20211216

Available from: 2021-11-02 Created: 2021-11-02 Last updated: 2022-06-25Bibliographically approved
Tånnander, C. & Edlund, J. (2019). First steps towards text profiling for speech synthesis. In: CEUR Workshop Proceedings: . Paper presented at 4th Conference on Digital Humanities in the Nordic Countries, DHN 2019, Copenhagen, Denmark, 5-8 March 2019 (pp. 457-468). CEUR-WS
Open this publication in new window or tab >>First steps towards text profiling for speech synthesis
2019 (English)In: CEUR Workshop Proceedings, CEUR-WS , 2019, p. 457-468Conference paper, Published paper (Refereed)
Abstract [en]

We discuss an important yet under-studied domain of language and speech research: spoken text. Spoken text is language that was originally produced as text, then presented to recipients as speech. From a research perspective, this domain warrants special treatment, and we propose a classification that affords a structured approach based on a division of a linguistic message to be investigated into a primary (original) and secondary (studied) form. Secondly, we present the MTM Read Aloud corpus (MTM-RAC), a Swedish text and speech corpus built on in excess of 10,000 books. The corpus is closed access due to copyright restrictions on the material, but the methods developed and the results of our work on the corpus are available for use with similar corpora. MTM-RAC is designed with spoken text in mind and contains texts that have been read aloud in order to produce talking books, either by a human or using speech synthesis (i.e. text-to-speech) and the corresponding sound files. Finally, as the main purpose of the corpus is to explore and evaluate different aspects of text profiling for the purpose of reading aloud, we present first insights into this kind of profiling, based on experiments carried out on the corpus.

Place, publisher, year, edition, pages
CEUR-WS, 2019
Keywords
Read aloud text, Spoken text, Talking books, Text profiling, Classification (of information), Special treatments, Speech research, Structured approach, Text to speech, Speech synthesis
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-280379 (URN)2-s2.0-85066028139 (Scopus ID)
Conference
4th Conference on Digital Humanities in the Nordic Countries, DHN 2019, Copenhagen, Denmark, 5-8 March 2019
Note

QC 20200907

Available from: 2020-09-07 Created: 2020-09-07 Last updated: 2022-06-25Bibliographically approved
Tånnander, C. & Edlund, J. (2019). Preliminary guidelines for the efficient management of OOV words for spoken text. In: Speech Synthesis Workshop (SSW): . Paper presented at Interspeech (pp. 137-142). , 10
Open this publication in new window or tab >>Preliminary guidelines for the efficient management of OOV words for spoken text
2019 (English)In: Speech Synthesis Workshop (SSW), 2019, Vol. 10, p. 137-142Conference paper, Published paper (Refereed)
Abstract [en]

We investigate the practical short-term and long-term effects of five different frequency ranks used for selecting which out-ofvocabulary (OOV) words to add to a pronunciation lexicon for text-to-speech (TTS) of university textbooks. The work is an empirical study on a corpus of 200 university text books selected for talking book production and it takes the extensive pronunciation lexicon of a commercial text-to-speech system as its baseline. The main take-home message is a short but succinct set of guidelines that promise to increase the efficiency of OOV management, at least for text-to-speech production of university text books. Index 

National Category
Other Engineering and Technologies
Identifiers
urn:nbn:se:kth:diva-273020 (URN)
Conference
Interspeech
Note

QC 20200511

Available from: 2020-05-05 Created: 2020-05-05 Last updated: 2024-03-18Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-9659-1532

Search in DiVA

Show all publications