kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 94) Show all publications
Tånnander, C., O'Regan, J., House, D., Edlund, J. & Beskow, J. (2024). Prosodic characteristics of English-accented Swedish neural TTS. In: Proceedings of Speech Prosody 2024: . Paper presented at Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024 (pp. 1035-1039). Leiden, The Netherlands: International Speech Communication Association
Open this publication in new window or tab >>Prosodic characteristics of English-accented Swedish neural TTS
Show others...
2024 (English)In: Proceedings of Speech Prosody 2024, Leiden, The Netherlands: International Speech Communication Association , 2024, p. 1035-1039Conference paper, Published paper (Refereed)
Abstract [en]

Neural text-to-speech synthesis (TTS) captures prosodicfeatures strikingly well, notwithstanding the lack of prosodiclabels in training or synthesis. We trained a voice on a singleSwedish speaker reading in Swedish and English. The resultingTTS allows us to control the degree of English-accentedness inSwedish sentences. English-accented Swedish commonlyexhibits well-known prosodic characteristics such as erroneoustonal accents and understated or missed durational differences.TTS quality was verified in three ways. Automatic speechrecognition resulted in low errors, verifying intelligibility.Automatic language classification had Swedish as the majoritychoice, while the likelihood of English increased with ourtargeted degree of English-accentedness. Finally, a rank ofperceived English-accentedness acquired through pairwisecomparisons by 20 human listeners demonstrated a strongcorrelation with the targeted English-accentedness.We report on phonetic and prosodic analyses of theaccented TTS. In addition to the anticipated segmentaldifferences, the analyses revealed temporal and prominencerelated variations coherent with Swedish spoken by Englishspeakers, such as missing Swedish stress patterns and overlyreduced unstressed syllables. With this work, we aim to gleaninsights into speech prosody from the latent prosodic featuresof neural TTS models. In addition, it will help implementspeech phenomena such as code switching in TTS

Place, publisher, year, edition, pages
Leiden, The Netherlands: International Speech Communication Association, 2024
Keywords
foreign-accented text-to-speech synthesis, neural text-to-speech synthesis, latent prosodic features
National Category
Humanities and the Arts General Language Studies and Linguistics
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-349946 (URN)10.21437/SpeechProsody.2024-209 (DOI)
Conference
Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024
Projects
Deep learning based speech synthesis for reading aloud of lengthy and information rich texts in Swedish (2018-02427)Språkbanken Tal (2017-00626)
Funder
Vinnova, (2018-02427
Note

QC 20240705

Available from: 2024-07-03 Created: 2024-07-03 Last updated: 2024-07-05Bibliographically approved
Tånnander, C., House, D. & Edlund, J. (2023). Analysis-by-synthesis: phonetic-phonological variation indeep neural network-based text-to-speech synthesis. In: Radek Skarnitzl and Jan Volín (Ed.), Proceedings of the 20th International Congress of Phonetic Sciences, Prague 2023: . Paper presented at 20th International Congress of Phonetic Sciences (ICPhS), August 7-11, 2023, Prague, Czech Republic (pp. 3156-3160). Prague, Czech Republic: GUARANT International
Open this publication in new window or tab >>Analysis-by-synthesis: phonetic-phonological variation indeep neural network-based text-to-speech synthesis
2023 (English)In: Proceedings of the 20th International Congress of Phonetic Sciences, Prague 2023 / [ed] Radek Skarnitzl and Jan Volín, Prague, Czech Republic: GUARANT International , 2023, p. 3156-3160Conference paper, Published paper (Refereed)
Abstract [en]

Text-to-speech synthesis based on deep neuralnetworks can generate highly humanlike speech,which revitalizes the potential for analysis-bysynthesis in speech research. We propose that neuralsynthesis can provide evidence that a specificdistinction in its transcription system represents arobust acoustic/phonetic distinction in the speechused to train the model.We synthesized utterances with allophones inincorrect contexts and analyzed the resultsphonetically. Our assumption was that if we gainedcontrol over the allophonic variation in this way, itwould provide strong evidence that the variation isgoverned robustly by the phonological context usedto create the transcriptions.Of three allophonic variations investigated, thefirst, which was believed to be quite robust, gave usrobust control over the variation, while the other two,which are less categorical, did not afford us suchcontrol. These findings are consistent with ourhypothesis and support the notion that neural TTS canbe a valuable analysis-by-synthesis tool for speechresearch. 

Place, publisher, year, edition, pages
Prague, Czech Republic: GUARANT International, 2023
Keywords
analysis-by-synthesis, latent phonetic features, phonological variation, neural TTS
National Category
Other Engineering and Technologies
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-336586 (URN)
Conference
20th International Congress of Phonetic Sciences (ICPhS), August 7-11, 2023, Prague, Czech Republic
Funder
Vinnova, 2018-02427
Note

Part of ISBN 978-80-908 114-2-3

QC 20230915

Available from: 2023-09-14 Created: 2023-09-14 Last updated: 2025-02-10Bibliographically approved
Ambrazaitis, G., Frid, J. & House, D. (2022). Auditory vs. audiovisual prominence ratings of speech involving spontaneously produced head movements. In: Proceedings of the 11th International Conference on Speech Prosody, Speech Prosody 2022: . Paper presented at 11th International Conference on Speech Prosody, Speech Prosody 2022, Lisbon, Portugal, May 23 2022 - May 26 2022 (pp. 352-356). International Speech Communication Association
Open this publication in new window or tab >>Auditory vs. audiovisual prominence ratings of speech involving spontaneously produced head movements
2022 (English)In: Proceedings of the 11th International Conference on Speech Prosody, Speech Prosody 2022, International Speech Communication Association , 2022, p. 352-356Conference paper, Published paper (Refereed)
Abstract [en]

Visual information can be integrated in prominence perception, but most available evidence stems from controlled experimental settings, often involving synthetic stimuli. The present study provides evidence from spontaneously produced head gestures that occurred in Swedish television news readings. Sixteen short clips (containing 218 words in total) were rated for word prominence by 85 adult volunteers in a between-subjects design (44 in an audio-visual vs. 41 in an audio-only condition) using a web-based rating task. As an initial test of overall rating behavior, average prominence across all 218 words was compared between the two conditions, revealing no significant difference. In a second step, we compared normalized prominence ratings between the two conditions for all 218 words individually. These results displayed significant (or near significant, p<.08) differences for 28 out of 218 words, with higher ratings in either the audiovisual (13 words) or the audio-only-condition (15 words). A detailed examination revealed that the presence of head movements (previously annotated) can boost prominence ratings in the audiovisual condition, while words with low prominence tend to be rated slightly higher in the audio-only condition. The study suggests that visual prominence signals are integrated in speech processing even in a relatively uncontrolled, naturalistic setting.

Place, publisher, year, edition, pages
International Speech Communication Association, 2022
Keywords
beat gesture, head movement, multimodality, pitch accent, prominence perception, visual prosody
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-335753 (URN)10.21437/SpeechProsody.2022-72 (DOI)2-s2.0-85147200055 (Scopus ID)
Conference
11th International Conference on Speech Prosody, Speech Prosody 2022, Lisbon, Portugal, May 23 2022 - May 26 2022
Note

QC 20230911

Available from: 2023-09-11 Created: 2023-09-11 Last updated: 2023-09-11Bibliographically approved
Ambrazaitis, G. & House, D. (2022). Probing effects of lexical prosody on speech-gesture integration in prominence production by Swedish news presenters. LABORATORY PHONOLOGY, 13(1), 1-35
Open this publication in new window or tab >>Probing effects of lexical prosody on speech-gesture integration in prominence production by Swedish news presenters
2022 (English)In: LABORATORY PHONOLOGY, ISSN 1868-6346, Vol. 13, no 1, p. 1-35Article in journal (Refereed) Published
Abstract [en]

This study investigates the multimodal implementation of prosodic-phonological categories, asking whether the accentual fall and the following rise in the Swedish word accents (Accent 1, Accent 2) are varied as a function of accompanying head and eyebrow gestures. Our purpose is to evaluate the hypothesis that prominence production displays a cumulative relation between acoustic and kinematic dimensions of spoken language, especially focusing on the clustering of gestures (head, eyebrows), at the same time asking if lexical-prosodic features would interfere with this cumulative relation. Our materials comprise 12 minutes of speech from Swedish television news presentations. The results reveal a significant trend for larger fo rises when a head movement accompanies the accented word, and even larger when an additional eyebrow movement is present. This trend is observed for accentual rises that encode phrase-level prominence, but not for accentual falls that are primarily related to lexical prosody. Moreover, the trend is manifested differently in different lexical-prosodic categories (Accent 1 versus Accent 2 with one versus two lexical stresses). The study provides novel support for a cumulative-cue hypothesis and the assumption that prominence production is essentially multimodal, well in line with the idea of speech and gesture as an integrated system.

Place, publisher, year, edition, pages
UBIQUITY PRESS LTD, 2022
National Category
General Language Studies and Linguistics Other Engineering and Technologies Human Computer Interaction
Identifiers
urn:nbn:se:kth:diva-316723 (URN)10.16995/labphon.6430 (DOI)000837826900001 ()2-s2.0-85137230741 (Scopus ID)
Note

QC 20220830

Available from: 2022-08-30 Created: 2022-08-30 Last updated: 2025-02-18Bibliographically approved
Tånnander, C., House, D. & Edlund, J. (2022). Syllable duration as a proxy to latent prosodic features. In: Proceedings of Speech Prosody 2022: . Paper presented at Speech Prosody 2022 23-26 May 2022, Lisbon, Portugal (pp. 220-224). Lisbon, Portugal: International Speech Communication Association
Open this publication in new window or tab >>Syllable duration as a proxy to latent prosodic features
2022 (English)In: Proceedings of Speech Prosody 2022, Lisbon, Portugal: International Speech Communication Association , 2022, p. 220-224Conference paper, Published paper (Refereed)
Abstract [en]

Recent advances in deep-learning have pushed text-to-speech synthesis (TTS) very close to human speech. In deep-learning, latent features refer to features that are hidden from us; notwithstanding, we may meaningfully observe their effects. Analogously, latent prosodic features refer to the exact features that constitute e.g. prominence that are unknown to us, although we know (some of) the functions of prominence and (some of) its acoustic correlates. Deep-learned speech models capture prosody well, but leave us with little control and few insights. Previously, we explored average syllable duration on word level - a simple and accessible metric - as a proxy for prominence: in Swedish TTS, where verb particles and numerals tend to receive too little prominence, these were nudged towards lengthening while allowing the TTS models to otherwise operate freely. Listener panels overwhelmingly preferred the nudged versions to the unmodified TTS. In this paper, we analyse utterances from the modified TTS. The analysis shows that duration-nudging of relevant words changes the following features in an observable manner: duration is predictably lengthened, word-initial glottalization occurs, and the general intonation pattern changes. This supports the view of latent prosodic features that can be reflected in deep-learned models and accessed by proxy.

Place, publisher, year, edition, pages
Lisbon, Portugal: International Speech Communication Association, 2022
National Category
Other Humanities not elsewhere specified
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-314984 (URN)10.21437/SpeechProsody.2022-45 (DOI)2-s2.0-85166333598 (Scopus ID)
Conference
Speech Prosody 2022 23-26 May 2022, Lisbon, Portugal
Note

QC 20220628

Available from: 2022-06-27 Created: 2022-06-27 Last updated: 2024-08-28Bibliographically approved
Domeij, R., Edlund, J., Eriksson, G., Fallgren, P., House, D., Lindström, E., . . . Öqvist, J. (2020). Exploring the archives for textual entry points to speech - Experiences of interdisciplinary collaboration in making cultural heritage accessible for research. In: CEUR Workshop Proceedings: . Paper presented at 2020 Twin Talks 2 and 3 Workshops at DHN 2020 and DH 2020: Understanding and Facilitating Collaboration in Digital Humanities 2020, TwinTalks 2020, 20 October 2020 (pp. 45-55). CEUR-WS
Open this publication in new window or tab >>Exploring the archives for textual entry points to speech - Experiences of interdisciplinary collaboration in making cultural heritage accessible for research
Show others...
2020 (English)In: CEUR Workshop Proceedings, CEUR-WS , 2020, p. 45-55Conference paper, Published paper (Refereed)
Abstract [en]

Tilltal (Tillgängligt kulturarv för forskning i tal, 'Accessible cultural heritage for speech research') is a multidisciplinary and methodological project undertaken by the Institute of Language and Folklore, KTH Royal Institute of Technology, and The Swedish National Archives in cooperation with the National Language Bank and SWE-CLARIN [1]. It aims to provide researchers better access to archival audio recordings using methods from language technology. The project comprises three case studies and one activity and usage study. In the case studies, actual research agendas from three different fields (ethnology, sociolinguistics and interaction analysis) serve as a basis for identifying procedures that may be simplified with the aid of digital tools. In the activity and usage study, we are applying an activity-theoretical approach with the aim of involving researchers and investigating how they use - and would like to be able to use - the archival resources at ISOF. Involving researchers in participatory design ensures that digital solutions are suggested and evaluated in relation to the requirements expressed by researchers engaged in specific research tasks [2]. In this paper we focus on one of the case studies, which investigates the process by which personal experience narratives are transformed into cultural heritage [3], and account for our results in exploring how different types of text material from the archives can be used to find relevant sections of the audio recordings. Finally, we discuss what lessons can be learned, and what conclusions can be drawn, from our experiences of interdisciplinary collaboration in the project.

Place, publisher, year, edition, pages
CEUR-WS, 2020
Keywords
Archive speech, Found data, Interdisciplinary collaboration, Participatory design, Digital devices, Cultural heritages, Interaction analysis, Interdisciplinary collaborations, Language technology, Personal experience, Royal Institute of Technology, Theoretical approach, Audio recordings
National Category
Arts
Identifiers
urn:nbn:se:kth:diva-290852 (URN)2-s2.0-85095968481 (Scopus ID)
Conference
2020 Twin Talks 2 and 3 Workshops at DHN 2020 and DH 2020: Understanding and Facilitating Collaboration in Digital Humanities 2020, TwinTalks 2020, 20 October 2020
Note

QC 20210322

Available from: 2021-03-22 Created: 2021-03-22 Last updated: 2022-06-25Bibliographically approved
Ambrazaitis, G., Frid, J. & House, D. (2020). Word prominence ratings in Swedish television news readings: Effects of pitch accents and head movements. In: Proceedings of the International Conference on Speech Prosody: . Paper presented at 10th International Conference on Speech Prosody 2020; Communicative and Interactive Prosody, Tokyo; Japan; 25 May 2020 through 28 May 2020 (pp. 314-318). International Speech Communication Association, 2020
Open this publication in new window or tab >>Word prominence ratings in Swedish television news readings: Effects of pitch accents and head movements
2020 (English)In: Proceedings of the International Conference on Speech Prosody, International Speech Communication Association , 2020, Vol. 2020, p. 314-318Conference paper, Published paper (Refereed)
Abstract [en]

Prosodic prominence is a multimodal phenomenon where pitch accents are frequently aligned with visible movements by the hands, head, or eyebrows. However, little is known about how such movements function as visible prominence cues in multimodal speech perception with most previous studies being restricted to experimental settings. In this study, we are piloting the acquisition of multimodal prominence ratings for a corpus of natural speech (Swedish television news readings). Sixteen short video clips (218 words) of news readings were extracted from a larger corpus and rated by 44 native Swedish adult volunteers using a web-based set-up. The task was to rate each word in a clip as either non-prominent, moderately prominent or strongly prominent based on audiovisual cues. The corpus was previously annotated for pitch accents and head movements. We found that words realized with a pitch accent and head movement tended to receive higher prominence ratings than words with a pitch accent only. However, we also examined ratings for a number of carefully selected individual words, and these case studies suggest that ratings are affected by complex relations between the presence of a head movement and its type of alignment, the word's F0 profile, and semantic and pragmatic factors.

Place, publisher, year, edition, pages
International Speech Communication Association, 2020
Series
Proceedings of the International Conference on Speech Prosody, ISSN 2333-2042 ; 2020
Keywords
Audiovisual prosody, Multimodal prominence, Multimodal speech perception, Case-studies, Head movements, Multi-modal, Natural speech, Pitch accents, Speech perception, Video clips, Web based, Semantics
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-290421 (URN)10.21437/SpeechProsody.2020-64 (DOI)2-s2.0-85093884721 (Scopus ID)
Conference
10th International Conference on Speech Prosody 2020; Communicative and Interactive Prosody, Tokyo; Japan; 25 May 2020 through 28 May 2020
Note

QC 20210222

Available from: 2021-02-22 Created: 2021-02-22 Last updated: 2022-06-25Bibliographically approved
Frid, J., Lundmark, M. S., Ambrazaitis, G., Schötz, S. & House, D. (2019). Investigating visual prosody using articulography. Paper presented at 4th Conference on Digital Humanities in the Nordic Countries, DHN 2019, Copenhagen, Denmark, 5-8 March 2019. CEUR Workshop Proceedings, 2364
Open this publication in new window or tab >>Investigating visual prosody using articulography
Show others...
2019 (English)In: CEUR Workshop Proceedings, ISSN 1613-0073, E-ISSN 1613-0073, ISSN 16130073, Vol. 2364Article in journal (Refereed) Published
Abstract [en]

In this paper we describe present work on multimodal prosody by means of simultaneous recordings of articulation and head movements. Earlier work has explored patterning, usage and machine-learning based detection of focal pitch accents, head beats and eyebrow beats through audiovisual recordings. Kinematic data obtained through articulography allows for more comparable and accurate measurements, as well as three-dimensional data. Therefore, our current approach involves examining speech and body movements concurrently, using electromagnetic articulography (EMA). We have recorded large amounts of this kind of data previously, but for other purposes. In this paper, we present results from a study on the interplay between head movements and phrasing and find tendencies for upward movements occuring before and downward movements occuring after prosodic boundaries.

Place, publisher, year, edition, pages
CEUR-WS, 2019
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-280448 (URN)2-s2.0-85066040918 (Scopus ID)
Conference
4th Conference on Digital Humanities in the Nordic Countries, DHN 2019, Copenhagen, Denmark, 5-8 March 2019
Note

QC 20200908

Available from: 2020-09-08 Created: 2020-09-08 Last updated: 2022-06-25Bibliographically approved
Hultén, M., Artman, H. & House, D. (2018). A model to analyse students’ cooperative ideageneration in conceptual design. International journal of technology and design education, 28(2), 451-470
Open this publication in new window or tab >>A model to analyse students’ cooperative ideageneration in conceptual design
2018 (English)In: International journal of technology and design education, ISSN 0957-7572, E-ISSN 1573-1804, Vol. 28, no 2, p. 451-470Article in journal (Refereed) Published
Abstract [en]

In this article we focus on the co-creation of ideas. Through the use of concepts from collaborative learning and communication theory we suggest a model that will enable the cooperative nature of creative design tasks to emerge. Four objectives of the model are stated and elaborated on in the paper: that the model should be anchored in previous research; that it should allow for collaborative aspects of creative design to be accounted for; that it should address the mechanisms by which new ideas are generated, embraced and cultivated during actual design; and that it should have a firm theoretical grounding. The model is also exemplified by two test sessions where two student pairs perform a time-constrained design task. We hope that the model can play a role both as an educational tool to be used by students and a teacher in design education, but primarily as a model to analyse students' cooperative idea generation in conceptual design.

Place, publisher, year, edition, pages
Springer, 2018
Keywords
Creativity, Collaborative Design, Model, Conceptual Design, Learning
National Category
Educational Sciences
Research subject
Art, Technology and Design
Identifiers
urn:nbn:se:kth:diva-194525 (URN)10.1007/s10798-016-9384-x (DOI)000432325800007 ()2-s2.0-84992740375 (Scopus ID)
Note

QC 20180531

Available from: 2016-10-31 Created: 2016-10-31 Last updated: 2024-03-18Bibliographically approved
Ambrazaitis, G. & House, D. (2017). Acoustic features of multimodal prominences: Do visual beat gestures affect verbal pitch accent realization?. In: Proceedings 14th International Conference on Auditory-Visual Speech Processing, AVSP 2017: . Paper presented at 14th International Conference on Auditory-Visual Speech Processing, AVSP 2017, Stockholm, Sweden, Aug 26 2017 - Aug 25 2017 (pp. 89-94). International Speech Communication Association
Open this publication in new window or tab >>Acoustic features of multimodal prominences: Do visual beat gestures affect verbal pitch accent realization?
2017 (English)In: Proceedings 14th International Conference on Auditory-Visual Speech Processing, AVSP 2017, International Speech Communication Association , 2017, p. 89-94Conference paper, Published paper (Refereed)
Abstract [en]

The interplay of verbal and visual prominence cues has attracted recent attention, but previous findings are inconclusive as to whether and how the two modalities are integrated in the production and perception of prominence. In particular, we do not know whether the phonetic realization of pitch accents is influenced by co-speech beat gestures, and previous findings seem to generate different predictions. In this study, we investigate acoustic properties of prominent words as a function of visual beat gestures in a corpus of read news from Swedish television. The corpus was annotated for head and eyebrow beats as well as sentence-level pitch accents. Four types of prominence cues occurred particularly frequently in the corpus: (1) pitch accent only, (2) pitch accent plus head, (3) pitch accent plus head plus eyebrows, and (4) head only. The results show that (4) differs from (1-3) in terms of a smaller pitch excursion and shorter syllable duration. They also reveal significantly larger pitch excursions in (2) than in (1), suggesting that the realization of a pitch accent is to some extent influenced by the presence of visual prominence cues. Results are discussed in terms of the interaction between beat gestures and prosody with a potential functional difference between head and eyebrow beats.

Place, publisher, year, edition, pages
International Speech Communication Association, 2017
Keywords
audio-visual prosody, co-speech gestures, multimodality, news speech, Swedish
National Category
General Language Studies and Linguistics Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-332100 (URN)10.21437/AVSP.2017-17 (DOI)2-s2.0-85050191964 (Scopus ID)
Conference
14th International Conference on Auditory-Visual Speech Processing, AVSP 2017, Stockholm, Sweden, Aug 26 2017 - Aug 25 2017
Note

QC 20230720

Available from: 2023-07-20 Created: 2023-07-20 Last updated: 2025-02-01Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-4628-3769

Search in DiVA

Show all publications