Endre søk
Link to record
Permanent link

Direct link
Publikasjoner (10 av 96) Visa alla publikasjoner
Tånnander, C., House, D., Beskow, J. & Edlund, J. (2025). Intrasentential English in Swedish TTS: perceived English-accentedness. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 1638-1642). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Intrasentential English in Swedish TTS: perceived English-accentedness
2025 (engelsk)Inngår i: Interspeech 2025, International Speech Communication Association , 2025, s. 1638-1642Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

English names and expressions are frequently inserted into Swedish text. Humans intuitively adjust the degree of English pronunciation of such insertions. This work aims at a Swedish text-to-speech synthesis (TTS) capable of similar controlled adaptation. We focus on two key aspects: (1) the development of a TTS system with controllable degrees of perceived English-accentedness (PEA); and (2) the exploration of human preferences related to PEA. We trained a Swedish TTS voice on Swedish and English sentences with a conditioning parameter for language (English-accentedness, EA) on a scale from 0 to 1, and estimated a psychometric mapping of the perceived effect of EA to a perceptual scale (PEA) through perception tests. PEA was then used in Best-Worst listening tests presenting English insertions with varying PEA. The results confirm the effectiveness of the training and the PEA scale, and that listener preferences change with different insertions.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2025
Emneord
controllable TTS, mixed language, read speech
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372797 (URN)10.21437/Interspeech.2025-762 (DOI)2-s2.0-105020040227 (Scopus ID)
Konferanse
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Merknad

QC 20251118

Tilgjengelig fra: 2025-11-18 Laget: 2025-11-18 Sist oppdatert: 2025-11-18bibliografisk kontrollert
Zellers, M., Gorisch, J. & House, D. (2025). Temporal relationships between speech and hand gestures in the vicinity of potential turn boundaries in German and Swedish conversation. Language and Cognition, 17, Article ID e57.
Åpne denne publikasjonen i ny fane eller vindu >>Temporal relationships between speech and hand gestures in the vicinity of potential turn boundaries in German and Swedish conversation
2025 (engelsk)Inngår i: Language and Cognition, ISSN 1866-9808, E-ISSN 1866-9859, Vol. 17, artikkel-id e57Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Both gesture and talk are basic building blocks of face-to-face conversation. In this study, we address the temporal dynamics of hand gesture phases relative to places and types of turn transition. We annotated gesture features and measured temporal aspects of gesture related to speech in two languages, German and Swedish. We found variation in the temporal relationships of gesture types and alignment of gesture phases that relate to the management of turn-taking in conversation. Specifically, the frequency of different gesture phases accompanying the offset of speech differed depending on whether the same speaker held the floor or whether a new speaker took up a turn. In addition, we found that differences in temporal alignment of gesture phases can distinguish between the type of turn transition that is upcoming up to a second before the place of transition is reached. Our results emphasize the importance of the interaction of the verbal and the gestural modality to maintain the smooth flow of conversation.

sted, utgiver, år, opplag, sider
Cambridge University Press (CUP), 2025
Emneord
co-speech gesture, conversation, German, gesture phases, hand gestures, potential turn boundary, Swedish, temporal gesture alignment, turn transitions
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-369036 (URN)10.1017/langcog.2025.10014 (DOI)001531783300001 ()2-s2.0-105011408133 (Scopus ID)
Merknad

QC 20250912

Tilgjengelig fra: 2025-09-12 Laget: 2025-09-12 Sist oppdatert: 2025-10-24bibliografisk kontrollert
Tånnander, C., O'Regan, J., House, D., Edlund, J. & Beskow, J. (2024). Prosodic characteristics of English-accented Swedish neural TTS. In: Proceedings of Speech Prosody 2024: . Paper presented at Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024 (pp. 1035-1039). Leiden, The Netherlands: International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Prosodic characteristics of English-accented Swedish neural TTS
Vise andre…
2024 (engelsk)Inngår i: Proceedings of Speech Prosody 2024, Leiden, The Netherlands: International Speech Communication Association , 2024, s. 1035-1039Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Neural text-to-speech synthesis (TTS) captures prosodicfeatures strikingly well, notwithstanding the lack of prosodiclabels in training or synthesis. We trained a voice on a singleSwedish speaker reading in Swedish and English. The resultingTTS allows us to control the degree of English-accentedness inSwedish sentences. English-accented Swedish commonlyexhibits well-known prosodic characteristics such as erroneoustonal accents and understated or missed durational differences.TTS quality was verified in three ways. Automatic speechrecognition resulted in low errors, verifying intelligibility.Automatic language classification had Swedish as the majoritychoice, while the likelihood of English increased with ourtargeted degree of English-accentedness. Finally, a rank ofperceived English-accentedness acquired through pairwisecomparisons by 20 human listeners demonstrated a strongcorrelation with the targeted English-accentedness.We report on phonetic and prosodic analyses of theaccented TTS. In addition to the anticipated segmentaldifferences, the analyses revealed temporal and prominencerelated variations coherent with Swedish spoken by Englishspeakers, such as missing Swedish stress patterns and overlyreduced unstressed syllables. With this work, we aim to gleaninsights into speech prosody from the latent prosodic featuresof neural TTS models. In addition, it will help implementspeech phenomena such as code switching in TTS

sted, utgiver, år, opplag, sider
Leiden, The Netherlands: International Speech Communication Association, 2024
Emneord
foreign-accented text-to-speech synthesis, neural text-to-speech synthesis, latent prosodic features
HSV kategori
Forskningsprogram
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-349946 (URN)10.21437/SpeechProsody.2024-209 (DOI)2-s2.0-105008058763 (Scopus ID)
Konferanse
Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024
Prosjekter
Deep learning based speech synthesis for reading aloud of lengthy and information rich texts in Swedish (2018-02427)Språkbanken Tal (2017-00626)
Forskningsfinansiär
Vinnova, (2018-02427
Merknad

QC 20240705

Tilgjengelig fra: 2024-07-03 Laget: 2024-07-03 Sist oppdatert: 2025-07-01bibliografisk kontrollert
Tånnander, C., House, D. & Edlund, J. (2023). Analysis-by-synthesis: phonetic-phonological variation indeep neural network-based text-to-speech synthesis. In: Radek Skarnitzl and Jan Volín (Ed.), Proceedings of the 20th International Congress of Phonetic Sciences, Prague 2023: . Paper presented at 20th International Congress of Phonetic Sciences (ICPhS), August 7-11, 2023, Prague, Czech Republic (pp. 3156-3160). Prague, Czech Republic: GUARANT International
Åpne denne publikasjonen i ny fane eller vindu >>Analysis-by-synthesis: phonetic-phonological variation indeep neural network-based text-to-speech synthesis
2023 (engelsk)Inngår i: Proceedings of the 20th International Congress of Phonetic Sciences, Prague 2023 / [ed] Radek Skarnitzl and Jan Volín, Prague, Czech Republic: GUARANT International , 2023, s. 3156-3160Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Text-to-speech synthesis based on deep neuralnetworks can generate highly humanlike speech,which revitalizes the potential for analysis-bysynthesis in speech research. We propose that neuralsynthesis can provide evidence that a specificdistinction in its transcription system represents arobust acoustic/phonetic distinction in the speechused to train the model.We synthesized utterances with allophones inincorrect contexts and analyzed the resultsphonetically. Our assumption was that if we gainedcontrol over the allophonic variation in this way, itwould provide strong evidence that the variation isgoverned robustly by the phonological context usedto create the transcriptions.Of three allophonic variations investigated, thefirst, which was believed to be quite robust, gave usrobust control over the variation, while the other two,which are less categorical, did not afford us suchcontrol. These findings are consistent with ourhypothesis and support the notion that neural TTS canbe a valuable analysis-by-synthesis tool for speechresearch. 

sted, utgiver, år, opplag, sider
Prague, Czech Republic: GUARANT International, 2023
Emneord
analysis-by-synthesis, latent phonetic features, phonological variation, neural TTS
HSV kategori
Forskningsprogram
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-336586 (URN)
Konferanse
20th International Congress of Phonetic Sciences (ICPhS), August 7-11, 2023, Prague, Czech Republic
Forskningsfinansiär
Vinnova, 2018-02427
Merknad

Part of ISBN 978-80-908 114-2-3

QC 20230915

Tilgjengelig fra: 2023-09-14 Laget: 2023-09-14 Sist oppdatert: 2025-02-10bibliografisk kontrollert
Ambrazaitis, G., Frid, J. & House, D. (2022). Auditory vs. audiovisual prominence ratings of speech involving spontaneously produced head movements. In: Proceedings of the 11th International Conference on Speech Prosody, Speech Prosody 2022: . Paper presented at 11th International Conference on Speech Prosody, Speech Prosody 2022, Lisbon, Portugal, May 23 2022 - May 26 2022 (pp. 352-356). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Auditory vs. audiovisual prominence ratings of speech involving spontaneously produced head movements
2022 (engelsk)Inngår i: Proceedings of the 11th International Conference on Speech Prosody, Speech Prosody 2022, International Speech Communication Association , 2022, s. 352-356Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Visual information can be integrated in prominence perception, but most available evidence stems from controlled experimental settings, often involving synthetic stimuli. The present study provides evidence from spontaneously produced head gestures that occurred in Swedish television news readings. Sixteen short clips (containing 218 words in total) were rated for word prominence by 85 adult volunteers in a between-subjects design (44 in an audio-visual vs. 41 in an audio-only condition) using a web-based rating task. As an initial test of overall rating behavior, average prominence across all 218 words was compared between the two conditions, revealing no significant difference. In a second step, we compared normalized prominence ratings between the two conditions for all 218 words individually. These results displayed significant (or near significant, p<.08) differences for 28 out of 218 words, with higher ratings in either the audiovisual (13 words) or the audio-only-condition (15 words). A detailed examination revealed that the presence of head movements (previously annotated) can boost prominence ratings in the audiovisual condition, while words with low prominence tend to be rated slightly higher in the audio-only condition. The study suggests that visual prominence signals are integrated in speech processing even in a relatively uncontrolled, naturalistic setting.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2022
Emneord
beat gesture, head movement, multimodality, pitch accent, prominence perception, visual prosody
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-335753 (URN)10.21437/SpeechProsody.2022-72 (DOI)2-s2.0-85147200055 (Scopus ID)
Konferanse
11th International Conference on Speech Prosody, Speech Prosody 2022, Lisbon, Portugal, May 23 2022 - May 26 2022
Merknad

QC 20230911

Tilgjengelig fra: 2023-09-11 Laget: 2023-09-11 Sist oppdatert: 2023-09-11bibliografisk kontrollert
Ambrazaitis, G. & House, D. (2022). Probing effects of lexical prosody on speech-gesture integration in prominence production by Swedish news presenters. LABORATORY PHONOLOGY, 13(1), 1-35
Åpne denne publikasjonen i ny fane eller vindu >>Probing effects of lexical prosody on speech-gesture integration in prominence production by Swedish news presenters
2022 (engelsk)Inngår i: LABORATORY PHONOLOGY, ISSN 1868-6346, Vol. 13, nr 1, s. 1-35Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

This study investigates the multimodal implementation of prosodic-phonological categories, asking whether the accentual fall and the following rise in the Swedish word accents (Accent 1, Accent 2) are varied as a function of accompanying head and eyebrow gestures. Our purpose is to evaluate the hypothesis that prominence production displays a cumulative relation between acoustic and kinematic dimensions of spoken language, especially focusing on the clustering of gestures (head, eyebrows), at the same time asking if lexical-prosodic features would interfere with this cumulative relation. Our materials comprise 12 minutes of speech from Swedish television news presentations. The results reveal a significant trend for larger fo rises when a head movement accompanies the accented word, and even larger when an additional eyebrow movement is present. This trend is observed for accentual rises that encode phrase-level prominence, but not for accentual falls that are primarily related to lexical prosody. Moreover, the trend is manifested differently in different lexical-prosodic categories (Accent 1 versus Accent 2 with one versus two lexical stresses). The study provides novel support for a cumulative-cue hypothesis and the assumption that prominence production is essentially multimodal, well in line with the idea of speech and gesture as an integrated system.

sted, utgiver, år, opplag, sider
UBIQUITY PRESS LTD, 2022
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-316723 (URN)10.16995/labphon.6430 (DOI)000837826900001 ()2-s2.0-85137230741 (Scopus ID)
Merknad

QC 20220830

Tilgjengelig fra: 2022-08-30 Laget: 2022-08-30 Sist oppdatert: 2025-02-18bibliografisk kontrollert
Tånnander, C., House, D. & Edlund, J. (2022). Syllable duration as a proxy to latent prosodic features. In: Proceedings of Speech Prosody 2022: . Paper presented at Speech Prosody 2022 23-26 May 2022, Lisbon, Portugal (pp. 220-224). Lisbon, Portugal: International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Syllable duration as a proxy to latent prosodic features
2022 (engelsk)Inngår i: Proceedings of Speech Prosody 2022, Lisbon, Portugal: International Speech Communication Association , 2022, s. 220-224Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Recent advances in deep-learning have pushed text-to-speech synthesis (TTS) very close to human speech. In deep-learning, latent features refer to features that are hidden from us; notwithstanding, we may meaningfully observe their effects. Analogously, latent prosodic features refer to the exact features that constitute e.g. prominence that are unknown to us, although we know (some of) the functions of prominence and (some of) its acoustic correlates. Deep-learned speech models capture prosody well, but leave us with little control and few insights. Previously, we explored average syllable duration on word level - a simple and accessible metric - as a proxy for prominence: in Swedish TTS, where verb particles and numerals tend to receive too little prominence, these were nudged towards lengthening while allowing the TTS models to otherwise operate freely. Listener panels overwhelmingly preferred the nudged versions to the unmodified TTS. In this paper, we analyse utterances from the modified TTS. The analysis shows that duration-nudging of relevant words changes the following features in an observable manner: duration is predictably lengthened, word-initial glottalization occurs, and the general intonation pattern changes. This supports the view of latent prosodic features that can be reflected in deep-learned models and accessed by proxy.

sted, utgiver, år, opplag, sider
Lisbon, Portugal: International Speech Communication Association, 2022
HSV kategori
Forskningsprogram
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-314984 (URN)10.21437/SpeechProsody.2022-45 (DOI)2-s2.0-85166333598 (Scopus ID)
Konferanse
Speech Prosody 2022 23-26 May 2022, Lisbon, Portugal
Merknad

QC 20220628

Tilgjengelig fra: 2022-06-27 Laget: 2022-06-27 Sist oppdatert: 2024-08-28bibliografisk kontrollert
Domeij, R., Edlund, J., Eriksson, G., Fallgren, P., House, D., Lindström, E., . . . Öqvist, J. (2020). Exploring the archives for textual entry points to speech - Experiences of interdisciplinary collaboration in making cultural heritage accessible for research. In: CEUR Workshop Proceedings: . Paper presented at 2020 Twin Talks 2 and 3 Workshops at DHN 2020 and DH 2020: Understanding and Facilitating Collaboration in Digital Humanities 2020, TwinTalks 2020, 20 October 2020 (pp. 45-55). CEUR-WS
Åpne denne publikasjonen i ny fane eller vindu >>Exploring the archives for textual entry points to speech - Experiences of interdisciplinary collaboration in making cultural heritage accessible for research
Vise andre…
2020 (engelsk)Inngår i: CEUR Workshop Proceedings, CEUR-WS , 2020, s. 45-55Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Tilltal (Tillgängligt kulturarv för forskning i tal, 'Accessible cultural heritage for speech research') is a multidisciplinary and methodological project undertaken by the Institute of Language and Folklore, KTH Royal Institute of Technology, and The Swedish National Archives in cooperation with the National Language Bank and SWE-CLARIN [1]. It aims to provide researchers better access to archival audio recordings using methods from language technology. The project comprises three case studies and one activity and usage study. In the case studies, actual research agendas from three different fields (ethnology, sociolinguistics and interaction analysis) serve as a basis for identifying procedures that may be simplified with the aid of digital tools. In the activity and usage study, we are applying an activity-theoretical approach with the aim of involving researchers and investigating how they use - and would like to be able to use - the archival resources at ISOF. Involving researchers in participatory design ensures that digital solutions are suggested and evaluated in relation to the requirements expressed by researchers engaged in specific research tasks [2]. In this paper we focus on one of the case studies, which investigates the process by which personal experience narratives are transformed into cultural heritage [3], and account for our results in exploring how different types of text material from the archives can be used to find relevant sections of the audio recordings. Finally, we discuss what lessons can be learned, and what conclusions can be drawn, from our experiences of interdisciplinary collaboration in the project.

sted, utgiver, år, opplag, sider
CEUR-WS, 2020
Emneord
Archive speech, Found data, Interdisciplinary collaboration, Participatory design, Digital devices, Cultural heritages, Interaction analysis, Interdisciplinary collaborations, Language technology, Personal experience, Royal Institute of Technology, Theoretical approach, Audio recordings
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-290852 (URN)2-s2.0-85095968481 (Scopus ID)
Konferanse
2020 Twin Talks 2 and 3 Workshops at DHN 2020 and DH 2020: Understanding and Facilitating Collaboration in Digital Humanities 2020, TwinTalks 2020, 20 October 2020
Merknad

QC 20210322

Tilgjengelig fra: 2021-03-22 Laget: 2021-03-22 Sist oppdatert: 2022-06-25bibliografisk kontrollert
Ambrazaitis, G., Frid, J. & House, D. (2020). Word prominence ratings in Swedish television news readings: Effects of pitch accents and head movements. In: Proceedings of the International Conference on Speech Prosody: . Paper presented at 10th International Conference on Speech Prosody 2020; Communicative and Interactive Prosody, Tokyo; Japan; 25 May 2020 through 28 May 2020 (pp. 314-318). International Speech Communication Association, 2020
Åpne denne publikasjonen i ny fane eller vindu >>Word prominence ratings in Swedish television news readings: Effects of pitch accents and head movements
2020 (engelsk)Inngår i: Proceedings of the International Conference on Speech Prosody, International Speech Communication Association , 2020, Vol. 2020, s. 314-318Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Prosodic prominence is a multimodal phenomenon where pitch accents are frequently aligned with visible movements by the hands, head, or eyebrows. However, little is known about how such movements function as visible prominence cues in multimodal speech perception with most previous studies being restricted to experimental settings. In this study, we are piloting the acquisition of multimodal prominence ratings for a corpus of natural speech (Swedish television news readings). Sixteen short video clips (218 words) of news readings were extracted from a larger corpus and rated by 44 native Swedish adult volunteers using a web-based set-up. The task was to rate each word in a clip as either non-prominent, moderately prominent or strongly prominent based on audiovisual cues. The corpus was previously annotated for pitch accents and head movements. We found that words realized with a pitch accent and head movement tended to receive higher prominence ratings than words with a pitch accent only. However, we also examined ratings for a number of carefully selected individual words, and these case studies suggest that ratings are affected by complex relations between the presence of a head movement and its type of alignment, the word's F0 profile, and semantic and pragmatic factors.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2020
Serie
Proceedings of the International Conference on Speech Prosody, ISSN 2333-2042 ; 2020
Emneord
Audiovisual prosody, Multimodal prominence, Multimodal speech perception, Case-studies, Head movements, Multi-modal, Natural speech, Pitch accents, Speech perception, Video clips, Web based, Semantics
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-290421 (URN)10.21437/SpeechProsody.2020-64 (DOI)2-s2.0-85093884721 (Scopus ID)
Konferanse
10th International Conference on Speech Prosody 2020; Communicative and Interactive Prosody, Tokyo; Japan; 25 May 2020 through 28 May 2020
Merknad

QC 20210222

Tilgjengelig fra: 2021-02-22 Laget: 2021-02-22 Sist oppdatert: 2022-06-25bibliografisk kontrollert
Frid, J., Lundmark, M. S., Ambrazaitis, G., Schötz, S. & House, D. (2019). Investigating visual prosody using articulography. Paper presented at 4th Conference on Digital Humanities in the Nordic Countries, DHN 2019, Copenhagen, Denmark, 5-8 March 2019. CEUR Workshop Proceedings, 2364
Åpne denne publikasjonen i ny fane eller vindu >>Investigating visual prosody using articulography
Vise andre…
2019 (engelsk)Inngår i: CEUR Workshop Proceedings, ISSN 1613-0073, E-ISSN 1613-0073, ISSN 16130073, Vol. 2364Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

In this paper we describe present work on multimodal prosody by means of simultaneous recordings of articulation and head movements. Earlier work has explored patterning, usage and machine-learning based detection of focal pitch accents, head beats and eyebrow beats through audiovisual recordings. Kinematic data obtained through articulography allows for more comparable and accurate measurements, as well as three-dimensional data. Therefore, our current approach involves examining speech and body movements concurrently, using electromagnetic articulography (EMA). We have recorded large amounts of this kind of data previously, but for other purposes. In this paper, we present results from a study on the interplay between head movements and phrasing and find tendencies for upward movements occuring before and downward movements occuring after prosodic boundaries.

sted, utgiver, år, opplag, sider
CEUR-WS, 2019
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-280448 (URN)2-s2.0-85066040918 (Scopus ID)
Konferanse
4th Conference on Digital Humanities in the Nordic Countries, DHN 2019, Copenhagen, Denmark, 5-8 March 2019
Merknad

QC 20200908

Tilgjengelig fra: 2020-09-08 Laget: 2020-09-08 Sist oppdatert: 2022-06-25bibliografisk kontrollert
Organisasjoner
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0002-4628-3769