kth.sePublications
Change search
Link to record
Permanent link

Direct link
Edlund, Jens, Docent/Associate ProfessorORCID iD iconorcid.org/0000-0001-9327-9482
Alternative names
Publications (10 of 141) Show all publications
Tånnander, C., House, D. & Edlund, J. (2023). Analysis-by-synthesis: phonetic-phonological variation indeep neural network-based text-to-speech synthesis. In: Radek Skarnitzl and Jan Volín (Ed.), Proceedings of the 20th International Congress of Phonetic Sciences, Prague 2023: . Paper presented at 20th International Congress of Phonetic Sciences (ICPhS), August 7-11, 2023, Prague, Czech Republic (pp. 3156-3160). Prague, Czech Republic: GUARANT International
Open this publication in new window or tab >>Analysis-by-synthesis: phonetic-phonological variation indeep neural network-based text-to-speech synthesis
2023 (English)In: Proceedings of the 20th International Congress of Phonetic Sciences, Prague 2023 / [ed] Radek Skarnitzl and Jan Volín, Prague, Czech Republic: GUARANT International , 2023, p. 3156-3160Conference paper, Published paper (Refereed)
Abstract [en]

Text-to-speech synthesis based on deep neuralnetworks can generate highly humanlike speech,which revitalizes the potential for analysis-bysynthesis in speech research. We propose that neuralsynthesis can provide evidence that a specificdistinction in its transcription system represents arobust acoustic/phonetic distinction in the speechused to train the model.We synthesized utterances with allophones inincorrect contexts and analyzed the resultsphonetically. Our assumption was that if we gainedcontrol over the allophonic variation in this way, itwould provide strong evidence that the variation isgoverned robustly by the phonological context usedto create the transcriptions.Of three allophonic variations investigated, thefirst, which was believed to be quite robust, gave usrobust control over the variation, while the other two,which are less categorical, did not afford us suchcontrol. These findings are consistent with ourhypothesis and support the notion that neural TTS canbe a valuable analysis-by-synthesis tool for speechresearch. 

Place, publisher, year, edition, pages
Prague, Czech Republic: GUARANT International, 2023
Keywords
analysis-by-synthesis, latent phonetic features, phonological variation, neural TTS
National Category
Other Engineering and Technologies not elsewhere specified
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-336586 (URN)
Conference
20th International Congress of Phonetic Sciences (ICPhS), August 7-11, 2023, Prague, Czech Republic
Funder
Vinnova, 2018-02427
Note

Part of ISBN 978-80-908 114-2-3

QC 20230915

Available from: 2023-09-14 Created: 2023-09-14 Last updated: 2023-09-15Bibliographically approved
Fallgren, P. & Edlund, J. (2023). Crowdsource-based validation of the audio cocktail as a sound browsing tool. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023 (pp. 2178-2182). International Speech Communication Association
Open this publication in new window or tab >>Crowdsource-based validation of the audio cocktail as a sound browsing tool
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 2178-2182Conference paper, Published paper (Refereed)
Abstract [en]

We conduct two crowdsourcing experiments designed to examine the usefulness of audio cocktails to quickly find out information on the contents of large audio data. Several thousand crowd workers were engaged to listen to audio cocktails with systematically varied composition. They were then asked to state either which sound out of four categories (Children, Women, Men, Orchestra) they heard the most of, or if they heard anything of a specific category at all. The results show that their responses have high reliability and provide information as to whether a specific task can be performed using audio cocktails. We also propose that the combination of crowd workers and audio cocktails can be used directly as a tool to investigate the contents of large audio data.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
annotation, exploration, found speech, hearing, human-in-the-loop
National Category
Language Technology (Computational Linguistics) Other Humanities not elsewhere specified
Identifiers
urn:nbn:se:kth:diva-337834 (URN)10.21437/Interspeech.2023-2473 (DOI)2-s2.0-85171584146 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023
Note

QC 20231009

Available from: 2023-10-09 Created: 2023-10-09 Last updated: 2023-10-09Bibliographically approved
Ekström, A. G. & Edlund, J. (2023). Evolution of the human tongue and emergence of speech biomechanics. Frontiers in Psychology, 14, Article ID 1150778.
Open this publication in new window or tab >>Evolution of the human tongue and emergence of speech biomechanics
2023 (English)In: Frontiers in Psychology, E-ISSN 1664-1078, Vol. 14, article id 1150778Article, review/survey (Refereed) Published
Abstract [en]

The tongue is one of the organs most central to human speech. Here, the evolution and species-unique properties of the human tongue is traced, via reference to the apparent articulatory behavior of extant non-human great apes, and fossil findings from early hominids - from a point of view of articulatory phonetics, the science of human speech production. Increased lingual flexibility provided the possibility of mapping of articulatory targets, possibly via exaptation of manual-gestural mapping capacities evident in extant great apes. The emergence of the human-specific tongue, its properties, and morphology were crucial to the evolution of human articulate speech.

Place, publisher, year, edition, pages
Frontiers Media SA, 2023
Keywords
evolution of speech, speech articulation, human evolution, speech production, primatology, articulatory phonetics, coarticulation, speech motor control
National Category
Other Medical Sciences not elsewhere specified
Identifiers
urn:nbn:se:kth:diva-330517 (URN)10.3389/fpsyg.2023.1150778 (DOI)001004893900001 ()37325743 (PubMedID)2-s2.0-85162047256 (Scopus ID)
Note

QC 20230630

Available from: 2023-06-30 Created: 2023-06-30 Last updated: 2023-06-30Bibliographically approved
Borin, L., Domeij, R., Edlund, J. & Forsberg, M. (2023). Language Report Swedish. In: Cognitive Technologies: (pp. 219-222). Springer Nature, Part F280
Open this publication in new window or tab >>Language Report Swedish
2023 (English)In: Cognitive Technologies, Springer Nature , 2023, Vol. Part F280, p. 219-222Chapter in book (Other academic)
Abstract [en]

Swedish speech and language technology (LT) research goes back over 70 years. This has paid off: there is a national research infrastructure, as well as significant research projects, and Swedish is well-endowed with language resources (LRs) and tools. However, there are gaps that need to be filled, especially high-quality goldstandard LRs required by the most recent deep-learning methods. In the future, we would like to see closer collaborations and communication between the “traditional” LT research community and the burgeoning AI field, the establishment of dedicated academic LT training programmes, and national funding for LT research.

Place, publisher, year, edition, pages
Springer Nature, 2023
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-333012 (URN)10.1007/978-3-031-28819-7_36 (DOI)2-s2.0-85161882703 (Scopus ID)
Note

QC 20230725

Available from: 2023-07-25 Created: 2023-07-25 Last updated: 2023-09-05Bibliographically approved
Pandey, A., Edlund, J., Le Maguer, S. & Harte, N. (2023). Listener sensitivity to deviating obstruents in WaveNet. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023 (pp. 1080-1084). International Speech Communication Association
Open this publication in new window or tab >>Listener sensitivity to deviating obstruents in WaveNet
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 1080-1084Conference paper, Published paper (Refereed)
Abstract [en]

This paper investigates the perceptual significance of the deviation in obstruents previously observed in WaveNet vocoders. The study involved presenting stimuli of varying lengths to 128 participants, who were asked to identify whether each stimulus was produced by a human or a machine. The participants' responses were captured using a 2-alternative forced choice task. The study found that while the length of the stimuli did not reliably affect participants' accuracy in the task, the concentration of obstruents did have a significant effect. Participants were consistently more accurate in identifying WaveNet stimuli as machine when the phrases were obstruent-rich. These findings show that the deviation in obstruents reported in WaveNet voices is perceivable by human listeners. The test protocol may be of wider utility in TTS.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
distortion, obstruents, perception, TTS evaluation, WaveNet
National Category
Psychology (excluding Applied Psychology)
Identifiers
urn:nbn:se:kth:diva-337831 (URN)10.21437/Interspeech.2023-1843 (DOI)2-s2.0-85171585188 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023
Note

QC 20231009

Available from: 2023-10-09 Created: 2023-10-09 Last updated: 2023-10-09Bibliographically approved
Edlund, J., Brodén, D., Fridlund, M., Lindhé, C., Olsson, L.-J. -., Ängsal, M. & Öhberg, P. (2022). A Multimodal Digital Humanities Study of Terrorism in Swedish Politics: An Interdisciplinary Mixed Methods Project on the Configuration of Terrorism in Parliamentary Debates, Legislation, and Policy Networks 1968–2018. In: Lecture Notes in Networks and Systems: . Paper presented at Intelligent Systems Conference, IntelliSys 2021, Virtual, Online,2 September 2021 to 3 September 2021. (pp. 435-449). Springer Nature, 295
Open this publication in new window or tab >>A Multimodal Digital Humanities Study of Terrorism in Swedish Politics: An Interdisciplinary Mixed Methods Project on the Configuration of Terrorism in Parliamentary Debates, Legislation, and Policy Networks 1968–2018
Show others...
2022 (English)In: Lecture Notes in Networks and Systems, Springer Nature , 2022, Vol. 295, p. 435-449Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents the design of one of Sweden’s largest digital humanities projects, SweTerror, that through an interdisciplinary multi-modal methodological approach develops an extensive speech-to-text digital HSS resource. SweTerror makes a major contribution to the study of terrorism in Sweden through a comprehensive mixed methods study of the political discourse on terrorism since the late 1960s. Drawing on artificial intelligence in the form of state-of-the-art language and speech technology, it systematically analyses all forms of relevant parliamentary utterances. It explores and curates an exhaustive but understudied multi-modal collection of primary sources of central relevance to Swedish democracy: the audio recordings of the Swedish Parliament’s debates. The project studies the framing of terrorism both as policy discourse and enacted politics, examining semantic and emotive components of the parliamentary discourse on terrorism as well as major actors and social networks involved. It covers political responses to a range of terrorism-related issues as well as factors influencing policy-makers’ engagement, including political affiliations and gender. SweTerror also develops an online research portal, featuring the complete research material and searchable audio made readily accessible for further exploration. Long-term, the project establishes a model for combining extraction technologies (speech recognition and analysis) for audiovisual parliamentary data with text mining and HSS interpretive methods and the portal is designed to serve as a prototype for other similar projects.

Place, publisher, year, edition, pages
Springer Nature, 2022
Keywords
Multimodal digital humanities, Speech technology, Terrorism studies
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-311197 (URN)10.1007/978-3-030-82196-8_32 (DOI)2-s2.0-85113468147 (Scopus ID)
Conference
Intelligent Systems Conference, IntelliSys 2021, Virtual, Online,2 September 2021 to 3 September 2021.
Note

Part of proceedings: ISBN 978-3-030-82195-1

QC 20220425

Available from: 2022-04-25 Created: 2022-04-25 Last updated: 2023-01-16Bibliographically approved
Tånnander, C. & Edlund, J. (2022). Mapping specific characteristics of spoken text to listener ratings. In: Proceedings of Fonetik 2022: . Paper presented at fonetikmötet, Fonetik 2022, 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan. Stockholm, Sweden
Open this publication in new window or tab >>Mapping specific characteristics of spoken text to listener ratings
2022 (English)In: Proceedings of Fonetik 2022, Stockholm, Sweden, 2022Conference paper, Published paper (Other academic)
Place, publisher, year, edition, pages
Stockholm, Sweden: , 2022
National Category
Language Technology (Computational Linguistics)
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-314986 (URN)
Conference
fonetikmötet, Fonetik 2022, 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan
Note

QCR 20220628

Available from: 2022-06-27 Created: 2022-06-27 Last updated: 2022-06-28Bibliographically approved
Tånnander, C. & Edlund, J. (2022). Sardin: speech-oriented text processing. In: Proceedings of Fonetik 2022: . Paper presented at fonetikmötet, Fonetik 2022, 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan. Stockholm, Sweden
Open this publication in new window or tab >>Sardin: speech-oriented text processing
2022 (English)In: Proceedings of Fonetik 2022, Stockholm, Sweden, 2022Conference paper, Published paper (Other academic)
Place, publisher, year, edition, pages
Stockholm, Sweden: , 2022
National Category
Language Technology (Computational Linguistics)
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-314985 (URN)
Conference
fonetikmötet, Fonetik 2022, 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan
Note

QCR 20220628

Available from: 2022-06-27 Created: 2022-06-27 Last updated: 2022-06-28Bibliographically approved
Tånnander, C., House, D. & Edlund, J. (2022). Syllable duration as a proxy to latent prosodic features. In: Proceedings of Speech Prosody 2022: . Paper presented at Speech Prosody 2022 23-26 May 2022, Lisbon, Portugal (pp. 220-224). Lisbon, Portugal: International Speech Communication Association
Open this publication in new window or tab >>Syllable duration as a proxy to latent prosodic features
2022 (English)In: Proceedings of Speech Prosody 2022, Lisbon, Portugal: International Speech Communication Association , 2022, p. 220-224Conference paper, Published paper (Refereed)
Abstract [en]

Recent advances in deep-learning have pushed text-to-speech synthesis (TTS) very close to human speech. In deep-learning, latent features refer to features that are hidden from us; notwithstanding, we may meaningfully observe their effects. Analogously, latent prosodic features refer to the exact features that constitute e.g. prominence that are unknown to us, although we know (some of) the functions of prominence and (some of) its acoustic correlates. Deep-learned speech models capture prosody well, but leave us with little control and few insights. Previously, we explored average syllable duration on word level - a simple and accessible metric - as a proxy for prominence: in Swedish TTS, where verb particles and numerals tend to receive too little prominence, these were nudged towards lengthening while allowing the TTS models to otherwise operate freely. Listener panels overwhelmingly preferred the nudged versions to the unmodified TTS. In this paper, we analyse utterances from the modified TTS. The analysis shows that duration-nudging of relevant words changes the following features in an observable manner: duration is predictably lengthened, word-initial glottalization occurs, and the general intonation pattern changes. This supports the view of latent prosodic features that can be reflected in deep-learned models and accessed by proxy.

Place, publisher, year, edition, pages
Lisbon, Portugal: International Speech Communication Association, 2022
National Category
Other Humanities not elsewhere specified
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-314984 (URN)10.21437/SpeechProsody.2022-45 (DOI)
Conference
Speech Prosody 2022 23-26 May 2022, Lisbon, Portugal
Note

QC 20220628

Available from: 2022-06-27 Created: 2022-06-27 Last updated: 2022-06-28Bibliographically approved
Tånnander, C. & Edlund, J. (2022). Towards a Swedish test set for speech-oriented text normalisation. In: : . Paper presented at Swedish Language Technology Conference (SLTC),November 18-20 2020, Göteborg. Göteborg: Göteborgs universitet
Open this publication in new window or tab >>Towards a Swedish test set for speech-oriented text normalisation
2022 (English)Conference paper, Published paper (Other academic)
Abstract [en]

Text-to-speech synthesis (TTS) can be split into two steps: the preprocessor, which takes input text, including its encoding and formatting, and turns it into a representation that is accepted by the synthesizer, which in turn converts this representation into an acoustic waveform representing speech. TTS is commonly evaluated in terms of how intelligible or humanlike the speech is, where different synthesizers working on the same input representation are regularly compared, whereas the preprocessing is habitually ignored in TTS evaluation. Were we to evaluate preprocessing, we could evaluate it as a whole (e.g. compare its output for some input representation to a target phonemic representation) or as individual processes such as sentence detection, tokenisation, text normalisation (TN) and pronunciation generation.This paper focuses on the evaluation of speech-oriented text normalisation (STN), that is the conversion of the input text into an expanded string of the words to be spoken, for example expansions of. abbreviations and different types of numerals. It is a request for comments for the creation of a test set for the evaluation of Swedish STN, which can be used as a baseline for future STN models, and as part of the overall evaluation of Swedish speech-oriented preprocessing.

Place, publisher, year, edition, pages
Göteborg: Göteborgs universitet, 2022
Keywords
speech-oriented text processing, test set
National Category
Language Technology (Computational Linguistics)
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-323669 (URN)
Conference
Swedish Language Technology Conference (SLTC),November 18-20 2020, Göteborg
Funder
Vinnova, 2018-02427
Note

QC 20230215

Available from: 2023-02-08 Created: 2023-02-08 Last updated: 2023-11-14Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-9327-9482

Search in DiVA

Show all publications