Text-to-speech synthesis (TTS) can be split into two steps: the preprocessor, which takes input text, including its encoding and formatting, and turns it into a representation that is accepted by the synthesizer, which in turn converts this representation into an acoustic waveform representing speech. TTS is commonly evaluated in terms of how intelligible or humanlike the speech is, where different synthesizers working on the same input representation are regularly compared, whereas the preprocessing is habitually ignored in TTS evaluation. Were we to evaluate preprocessing, we could evaluate it as a whole (e.g. compare its output for some input representation to a target phonemic representation) or as individual processes such as sentence detection, tokenisation, text normalisation (TN) and pronunciation generation.This paper focuses on the evaluation of speech-oriented text normalisation (STN), that is the conversion of the input text into an expanded string of the words to be spoken, for example expansions of. abbreviations and different types of numerals. It is a request for comments for the creation of a test set for the evaluation of Swedish STN, which can be used as a baseline for future STN models, and as part of the overall evaluation of Swedish speech-oriented preprocessing.
QC 20230215