kth.sePublications
Change search
Refine search result
1 - 38 of 38
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Beck, Gustavo
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Wennberg, Ulme
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Wavebender GAN: An architecture for phonetically meaningful speech manipulation2022In: 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE conference proceedings, 2022Conference paper (Refereed)
    Abstract [en]

    Deep learning has revolutionised synthetic speech quality. However, it has thus far delivered little value to the speech science community. The new methods do not meet the controllability demands that practitioners in this area require e.g.: in listening tests with manipulated speech stimuli. Instead, control of different speech properties in such stimuli is achieved by using legacy signal-processing methods. This limits the range, accuracy, and speech quality of the manipulations. Also, audible artefacts have a negative impact on the methodological validity of results in speech perception studies.This work introduces a system capable of manipulating speech properties through learning rather than design. The architecture learns to control arbitrary speech properties and leverages progress in neural vocoders to obtain realistic output. Experiments with copy synthesis and manipulation of a small set of core speech features (pitch, formants, and voice quality measures) illustrate the promise of the approach for producing speech stimuli that have accurate control and high perceptual quality.

  • 2.
    Fallgren, P.
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Z.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    A tool for exploring large amounts of found audio data2018In: CEUR Workshop Proceedings, CEUR-WS , 2018, p. 499-503Conference paper (Refereed)
    Abstract [en]

    We demonstrate a method and a set of open source tools (beta) for nonsequential browsing of large amounts of audio data. The demonstration will contain versions of a set of functionalities in their first stages, and will provide a good insight in how the method can be used to browse through large quantities of audio data efficiently.

  • 3.
    Fallgren, Per
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Bringing order to chaos: A non-sequential approach for browsing large sets of found audio data2019In: Proceedings Of The Eleventh International Conference On Language Resources And Evaluation (LREC 2018), European Language Resources Association (ELRA) , 2019, p. 4307-4311Conference paper (Refereed)
    Abstract [en]

    We present a novel and general approach for fast and efficient non-sequential browsing of sound in large archives that we know little or nothing about, e.g. so called found data - data not recorded with the specific purpose to be analysed or used as training data. Our main motivation is to address some of the problems speech and speech technology researchers see when they try to capitalise on the huge quantities of speech data that reside in public archives. Our method is a combination of audio browsing through massively multi-object sound environments and a well-known unsupervised dimensionality reduction algorithm (SOM). We test the process chain on four data sets of different nature (speech, speech and music, farm animals, and farm animals mixed with farm sounds). The methods are shown to combine well, resulting in rapid and readily interpretable observations. Finally, our initial results are demonstrated in prototype software which is freely available.

  • 4.
    Fallgren, Per
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    How to annotate 100 hours in 45 minutes2019In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISCA , 2019, p. 341-345Conference paper (Refereed)
    Abstract [en]

    Speech data found in the wild hold many advantages over artificially constructed speech corpora in terms of ecological validity and cultural worth. Perhaps most importantly, there is a lot of it. However, the combination of great quantity, noisiness and variation poses a challenge for its access and processing. Generally speaking, automatic approaches to tackle the problem require good labels for training, while manual approaches require time. In this study, we provide further evidence for a semi-supervised, human-in-the-loop framework that previously has shown promising results for browsing and annotating large quantities of found audio data quickly. The findings of this study show that a 100-hour long subset of the Fearless Steps corpus can be annotated for speech activity in less than 45 minutes, a fraction of the time it would take traditional annotation methods, without a loss in performance.

  • 5.
    Fallgren, Per
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Towards fast browsing of found audio data: 11 presidents2019In: CEUR Workshop Proceedings, CEUR-WS , 2019, p. 133-142Conference paper (Refereed)
    Abstract [en]

    Our aim is to rapidly explore prohibitively large audio collections by exploiting the insight that people are able to make fast judgments about lengthy recordings by listening to temporally disassembled audio (TDA) segments played simultaneously. We have previously shown the proof-of-concept; here we develop the method and corroborate its usefulness. We conduct an experiment with untrained human annotators, and show that they are able to place meaningful annotation on a completely unknown 8 hour corpus in a matter of minutes. The audio is temporally disassembled and spread out over a 2-dimensional map. Participants explore the resulting soundscape by hovering over different regions with a mouse. We used a collection of 11 State of the Union addresses given by 11 different US presidents, spread over half a century in time, as a corpus. The results confirm that (a) participants can distinguish between different regions and are able to describe the general contents of these regions; (b) the regions identified serve as labels describing the contents of the original audio collection; and (c) that the regions and labels can be used to segment the temporally reassembled audio into categories. We include an evaluation of the last step for completeness.

  • 6.
    Inden, Benjamin
    et al.
    Bielefeld University.
    Malisz, Zofia
    Bielefeld University.
    Wagner, Petra
    Bielefeld University.
    Wachsmuth, Ipke
    Bielefeld University.
    Micro-timing of backchannels in human-robot interaction2014Conference paper (Refereed)
  • 7. Inden, Benjamin
    et al.
    Malisz, Zofia
    Wagner, Petra
    Wachsmuth, Ipke
    Rapid entrainment to spontaneous speech: A comparison of oscillator models2012In: Proceedings of the 34th Annual Conference of the Cognitive Science Society, 2012, p. 1721-1726Conference paper (Refereed)
  • 8. Inden, Benjamin
    et al.
    Malisz, Zofia
    Bielefeld University, Germany.
    Wagner, Petra
    Wachsmuth, Ipke
    Timing and entrainment of multimodal backchanneling behavior for an embodied conversational agent2013Conference paper (Refereed)
    Abstract [en]

    We report on an analysis of feedback behavior in an Active Listening Corpus as produced verbally, visually (head movement) and bimodally. The behavior is modeled in an embodied conversational agent and displayed in a conversation with a real human to human participants for perceptual evaluation. Five strategies for the timing of backchannels are compared: copying the timing of the original human listener, producing backchannels at randomly selected times, producing backchannels according to high level timing distributions relative to the interlocutor's utterance and pauses, or according to local entrainment to the interlocutors' vowels, or according to both. Human observers judge that models with global timing distributions miss less opportunities for backchanneling than random timing.

  • 9.
    Jonell, Patrik
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Bystedt, Mattias
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Fallgren, Per
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    David Aguas Lopes, José
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Mascarenhas, Samuel
    GAIPS INESC-ID, Lisbon, Portugal.
    Oertel, Catharine
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Eran, Raveh
    Multimodal Computing and Interaction, Saarland University, Germany.
    Shore, Todd
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    FARMI: A Framework for Recording Multi-Modal Interactions2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris: European Language Resources Association, 2018, p. 3969-3974Conference paper (Refereed)
    Abstract [en]

    In this paper we present (1) a processing architecture used to collect multi-modal sensor data, both for corpora collection and real-time processing, (2) an open-source implementation thereof and (3) a use-case where we deploy the architecture in a multi-party deception game, featuring six human players and one robot. The architecture is agnostic to the choice of hardware (e.g. microphones, cameras, etc.) and programming languages, although our implementation is mostly written in Python. In our use-case, different methods of capturing verbal and non-verbal cues from the participants were used. These were processed in real-time and used to inform the robot about the participants’ deceptive behaviour. The framework is of particular interest for researchers who are interested in the collection of multi-party, richly recorded corpora and the design of conversational systems. Moreover for researchers who are interested in human-robot interaction the available modules offer the possibility to easily create both autonomous and wizard-of-Oz interactions.

    Download full text (pdf)
    fulltext
  • 10.
    Karpinski, Maciej
    et al.
    Adam Mickiewicz University.
    Jarmolowicz-Nowikow, Ewa
    Adam Mickiewicz University.
    Malisz, Zofia
    Adam Mickiewicz University.
    Juszczyk, Konrad
    Adam Mickiewicz University.
    Szczyszek, Michal
    Adam Mickiewicz University.
    Rejestracja, transkrypcja i tagowanie mowy oraz gestów w narracji dzieci i doroslych [The recording, transcription and annotation of speech and gesture by adults and children in a narration corpus]2008In: Investigationes Linguisticae, ISSN 1426-188X, E-ISSN 1733-1757, Vol. XVI, p. 83-98Article in journal (Refereed)
  • 11. Kousidis, Spyros
    et al.
    Malisz, Zofia
    Wagner, Petra
    Schlangen, David
    Exploring annotation of head gesture forms in spontaneous human interaction.2013In: TiGeR 2013, Tilburg Gesture Research Meeting, 2013Conference paper (Refereed)
  • 12. Kousidis, Spyros
    et al.
    Pfeiffer, Thies
    Malisz, Zofia
    Wagner, Petra
    Schlangen, David
    Evaluating a minimally invasive laboratory architecture for recording multimodal conversational data2012In: Proc. of the Interdisciplinary Workshop on Feedback Behaviours in Dialogue, 2012Conference paper (Refereed)
  • 13.
    Malisz, Zofia
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Berthelsen, H.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Controlling prominence realisation in parametric DNN-based speech synthesis2017In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, International Speech Communication Association , 2017, Vol. 2017, p. 1079-1083Conference paper (Refereed)
    Abstract [en]

    This work aims to improve text-To-speech synthesis forWikipedia by advancing and implementing models of prosodic prominence. We propose a new system architecture with explicit prominence modeling and test the first component of the architecture. We automatically extract a phonetic feature related to prominence from the speech signal in the ARCTIC corpus. We then modify the label files and train an experimental TTS system based on the feature using Merlin, a statistical-parametric DNN-based engine. Test sentences with contrastive prominence on the word-level are synthesised and separate listening tests a) evaluating the level of prominence control in generated speech, and b) naturalness, are conducted. Our results show that the prominence feature-enhanced system successfully places prominence on the appropriate words and increases perceived naturalness relative to the baseline.

  • 14.
    Malisz, Zofia
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Berthelsen, Harald
    STTS – Södermalms talteknologiservice AB.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    PROMIS: a statistical-parametric speech synthesis system with prominence control via a prominence network2019In: Proceedings of SSW 10 - The 10th ISCA Speech Synthesis Workshop, Vienna, 2019Conference paper (Refereed)
    Abstract [en]

    We implement an architecture with explicit prominence learning via a prominence network in Merlin, a statistical-parametric DNN-based text-to-speech system. We build on our previous results that successfully evaluated the inclusion of an automatically extracted, speech-based prominence feature into the training and its control at synthesis time. In this work, we expand the PROMIS system by implementing the prominence network that predicts prominence values from text. We test the network predictions as well as the effects of a prominence control module based on SSML-like tags. Listening tests for the complete PROMIS system, combining a prominence feature, a prominence network and prominence control, show that it effectively controls prominence in a diagnostic set of target words. The tests also show a minor negative impact on perceived naturalness, relative to baseline, exerted by the two prominence tagging methods implemented in the control module.

    Download full text (pdf)
    fulltext
  • 15.
    Malisz, Zofia
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Brandt, E.
    Möbius, B.
    Oh, Y. M.
    Andreeva, B.
    Dimensions of Segmental Variability: Interaction of Prosody and Surprisal in Six Languages2018In: Frontiers in Communication, E-ISSN 2297-900X, Vol. 3, article id 00025Article in journal (Refereed)
    Abstract [en]

    Contextual predictability variation affects phonological and phonetic structure. Reduction and expansion of acoustic-phonetic features is also characteristic of prosodic variability. In this study, we assess the impact of surprisal and prosodic structure on phonetic encoding, both independently of each other and in interaction. We model segmental duration, vowel space size and spectral characteristics of vowels and consonants as a function of surprisal as well as of syllable prominence, phrase boundary, and speech rate. Correlates of phonetic encoding density are extracted from a subset of the BonnTempo corpus for six languages: American English, Czech, Finnish, French, German, and Polish. Surprisal is estimated from segmental n-gram language models trained on large text corpora. Our findings are generally compatible with a weak version of Aylett and Turk's Smooth Signal Redundancy hypothesis, suggesting that prosodic structure mediates between the requirements of efficient communication and the speech signal. However, this mediation is not perfect, as we found evidence for additional, direct effects of changes in surprisal on the phonetic structure of utterances. These effects appear to be stable across different speech rates.

  • 16. Malisz, Zofia
    et al.
    Henter, Gustav Eje
    Valentini-Botinhao, Cassia
    The University of Edinburgh, UK.
    Watts, Oliver
    The University of Edinburgh, UK.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    Modern speech synthesis for phonetic sciences: A discussion and an evaluation2019In: Proceedings of ICPhS, 2019Conference paper (Refereed)
    Download full text (pdf)
    fulltext
  • 17.
    Malisz, Zofia
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Valentini-Botinhao, Cassia
    The Centre for Speech Technology, The University of Edinburgh, UK.
    Watts, Oliver
    The Centre for Speech Technology, The University of Edinburgh, UK.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The speech synthesis phoneticians need is both realistic and controllable2019In: Proceedings from FONETIK 2019, Stockholm, 2019Conference paper (Refereed)
    Abstract [en]

    We discuss the circumstances that have led to a disjoint advancement of speech synthesis and phonetics in recent dec- ades. The difficulties mainly rest on the pursuit of orthogonal goals by the two fields: realistic vs. controllable synthetic speech. We make a case for realising the promise of speech technologies in areas of speech sciences by developing control of neural speech synthesis and bringing the two areas into dialogue again.

    Download full text (pdf)
    fulltext
  • 18.
    Malisz, Zofia
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Jonell, Patrik
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    The visual prominence of whispered speech in Swedish2019In: Proceedings of 19th International Congress of Phonetic Sciences, 2019Conference paper (Refereed)
    Abstract [en]

    This study presents a database of controlled speech material as well as spontaneous Swedish conversation produced in modal and whispered voice. The database includes facial expression and head movement features tracked by a non-invasive and unobtrusive system. We analyse differences between the voice conditions in the visual domain paying particular attention to realisations of prosodic structure, namely, prominence patterns. Analysis results show that prominent vowels in whisper are expressed with a statistically significantly a) larger jaw opening, b) stronger lip rounding and protrusion, c) higher eyebrow raising and d) higher pitch angle velocity of the head, relative to modal speech.

    Download full text (pdf)
    fulltext
  • 19.
    Malisz, Zofia
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    O'Dell, Michael
    University of Tampere.
    Nieminen, Tommi
    University of Eastern Finland.
    Wagner, Petra
    Bielefeld University.
    Perspectives on speech timing: coupled oscillator modeling of Polish and Finnish2016In: Phonetica, ISSN 0031-8388, E-ISSN 1423-0321, Vol. 73, no 3-4Article in journal (Refereed)
    Abstract [en]

    We use an updated version of the Coupled Oscillator Model of speech timing and rhythm variability (O'Dell and Nieminen, 1999;2009) to analyze empirical duration data for Polish spoken at different tempos. We use Bayesian inference on parameters relating to speech rate to investigate how tempo affects timing in Polish. The model parameters found are then compared to parameters obtained for equivalent material in Finnish to shed light on which of the effects represent general speech rate mechanisms and which are specific to Polish. We discuss the model and its predictions in the context of current perspectives on speech timing.

  • 20. Malisz, Zofia
    et al.
    Wagner, Petra
    Acoustic-phonetic realisation of Polish syllable prominence: A corpus study2012In: Speech and Language Technology, ISSN 1895-0434, Vol. 14/15, p. 105-114Article in journal (Refereed)
    Abstract [en]

    Polish presents an interesting case for testing alternative phonetic implementations of prominence: It has fixed lexical stress on the penultimate, it has been difficult to classify within the classic ‘stress-timing’ vs. ‘syllable-timing’ dichotomy [1, 2, inter alia] and its stress is regarded as ‘weakly expressed’ [3]. We investigate acoustic correlates of Polish prominence patterns in a corpus of spontaneous, task-oriented dialogue. Results indicate clear differences to prior analyses of more controlled data, with intensity but also duration and pitch movement being main indicators of prominence.

  • 21. Malisz, Zofia
    et al.
    Włodarczak, Marcin
    Buschmeier, Hendrik
    Kopp, Stefan
    Wagner, Petra
    Prosodic characteristics of feedback expressions in distracted and non-distracted listeners2012In: Proceedings of The Listening Talker: An Interdisciplinary Workshop on Natural and Synthetic Modification of Speech in Response to Listening Conditions, Edinburgh, 2012, p. 36-39Conference paper (Refereed)
    Abstract [en]

    In a previous study (Buschmeier et al., INTERSPEECH-2011) we investigated properties of communicative feedback produced by attentive and non-attentive listeners in dialogue. Distracted listeners were found to produce less feedback communicating understanding. Here, we assess the role of prosody in differentiating between feedback functions. We find significant differences across all studied prosodic dimensions as well as influences of lexical form and phonetic structure on feedback function categorisation. We also show that differences in prosodic features between attentiveness states exist, e.g., in overall intensity.

  • 22.
    Malisz, Zofia
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology. Saarland University, Germany.
    Włodarczak, Marcin
    Stockholms Universitet.
    Buschmeier, Hendrik
    Bielefeld University.
    Skubisz, Joanna
    Universidade Nova de Lisboa.
    Kopp, Stefan
    Bielefeld University.
    Wagner, Petra
    Bielefeld University.
    The ALICO corpus: analysing the active listener2016In: Language resources and evaluation, ISSN 1574-020X, E-ISSN 1574-0218, Vol. 50, no 2, p. 411-442Article in journal (Refereed)
    Abstract [en]

    The Active Listening Corpus (ALICO) is a multimodal data set of spontaneous dyadic conversations in German with diverse speech and gestural annotations of both dialogue partners. The annotations consist of short feedback expression transcriptions with corresponding communicative function interpretations as well as segmentations of interpausal units, words, rhythmic prominence intervals and vowel-to-vowel intervals. Additionally, ALICO contains head gesture annotations of both interlocutors. The corpus contributes to research on spontaneous human–human interaction, on functional relations between modalities, and timing variability in dialogue. It also provides data that differentiates between distracted and attentive listeners. We describe the main characteristics of the corpus and briefly present the most important results obtained from analyses in recent years.

    Download full text (pdf)
    fulltext
  • 23.
    Malisz, Zofia
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Zygis, Marzena
    Special Issue: Slavic Perspectives on Prosody2016In: Phonetica, ISSN 0031-8388, E-ISSN 1423-0321, Vol. 73, no 3-4, p. 155-162Article in journal (Refereed)
  • 24.
    Malisz, Zofia
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Żygis, M.
    Lexical stress in Polish: Evidence from focus and phrase-position differentiated production data2018In: Proceedings of the International Conference on Speech Prosody, International Speech Communications Association , 2018, p. 1008-1012Conference paper (Refereed)
    Abstract [en]

    We examine acoustic patterns of word stress in Polish in data with carefully separated phrase- and word-level prominences. We aim to verify claims in the literature regarding the phonetic and phonological status of lexical stress (both primary and secondary) in Polish and to contribute to a better understanding of prosodic prominence and boundary interactions. Our results show significant effects of primary stress on acoustic parameters such as duration, f0 functionals and spectral emphasis expected for a stress language. We do not find clear and systematic acoustic evidence for secondary stress.

  • 25.
    Malisz, Zofia
    et al.
    Bielefeld University.
    Żygis, Marzena
    Humboldt-Universität zu Berlin and Leibniz-Zentrum Allgemeine Sprachwissenschaft.
    Voicing in Polish: interactions with lexical stress and focus2015In: Proceedings of the 18th International Congress of Phonetic Sciences, Glasgow, 2015Conference paper (Refereed)
    Abstract [en]

    We examine the dynamics of VOT in Polish stops under lexical stress and focus. We elicit real Polish words containing voiced and voiceless stop+/a/ syllables in primary, secondary and unstressed, as well as focus positions. We also correlate VOT with speech rate estimated on the basis of equisyllabic word length. Our results show that the relationships between prosody and VOT are consistent with the status of Polish as a true voicing language.

  • 26.
    O'Dell, Michael
    et al.
    University of Tampere.
    Malisz, Zofia
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Perception of Geminates in Finnish and Polish2016In: Proceedings of Speech Prosody 2016 / [ed] Barnes, Jon and Brugos, Alejna and Shattuck-Hufnagel, Stefanie and Veilleux, Nanette, Boston, MA, 2016, p. 1109-1113Conference paper (Refereed)
  • 27.
    Pérez Zarazaga, Pablo
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    A processing framework to access large quantities of whispered speech found in ASMR2023In: ICASSP 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece: IEEE Signal Processing Society, 2023Conference paper (Refereed)
    Abstract [en]

    Whispering is a ubiquitous mode of communication that humansuse daily. Despite this, whispered speech has been poorly servedby existing speech technology due to a shortage of resources andprocessing methodology. To remedy this, this paper provides a pro-cessing framework that enables access to large and unique data ofhigh-quality whispered speech. We obtain the data from recordingssubmitted to online platforms as part of the ASMR media-culturalphenomenon. We describe our processing pipeline and a method forimproved whispered activity detection (WAD) in the ASMR data.To efficiently obtain labelled, clean whispered speech, we comple-ment the automatic WAD by using Edyson, a bulk audio annotationtool with human-in-the-loop. We also tackle a problem particular toASMR: separation of whisper from other acoustic triggers presentin the genre. We show that the proposed WAD and the efficient la-belling allows to build extensively augmented data and train a clas-sifier that extracts clean whisper segments from ASMR audio.Our large and growing dataset enables whisper-capable, data-driven speech technology and linguistic analysis. It also opens op-portunities in e.g. HCI as a resource that may elicit emotional, psy-chological and neuro-physiological responses in the listener.

    Download full text (pdf)
    fulltext
  • 28.
    Pérez Zarazaga, Pablo
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Feature Selection for Labelling of Whispered Speech in ASMR Recordings using Edyson2022Conference paper (Refereed)
    Abstract [en]

    Whispered speech is a challenging area for traditional speech processing algorithms, as its properties differ from phonated speech and whispered data is not as easily available. A great amount of whispered speech recordings, however, can be foundin the increasingly popular genre of ASMR in streaming plat-forms like Youtbe or Twitch. Whispered speech is used in thisgenre as a trigger to cause a relaxing sensation in the listener. Accurately separating whispered speech segments from otherauditory triggers would provide a wide variety of whispered data, that could prove useful in improving the performance ofdata driven speech processing methods. We use Edyson as a labelling tool, with which a user can rapidly assign labels tolong segments of audio using an interactive graphical inter-face. In this paper, we propose features that can improve the performance of Edyson with whispered speech and we analyseparameter configurations for different types of sounds. We find Edyson a useful tool for initial labelling of audio data extractedfrom ASMR recordings that can then be used in more complexmodels. Our proposed modifications provide a better sensibil-ity for whispered speech, thus improving the performance of Edyson in the labelling of whispered segments.

  • 29.
    Pérez Zarazaga, Pablo
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Recovering implicit pitch contours from formants in whispered speech2023Conference paper (Refereed)
    Abstract [en]

    Whispered speech is characterised by a noise-likeexcitation that results in the lack of fundamentalfrequency. Considering that prosodic phenomenasuch as intonation are perceived through f0variation, the perception of whispered prosodyis relatively difficult. At the same time,studies have shown that speakers do attemptto produce intonation when whispering and thatprosodic variability is being transmitted, suggestingthat intonation "survives" in whispered formantstructure.In this paper, we aim to estimate the way in whichformant contours correlate with an "implicit" pitchcontour in whisper, using a machine learning model.We propose a two-step method: using a parallelcorpus, we first transform the whispered formantsinto their phonated equivalents using a denoisingautoencoder. We then analyse the formant contoursto predict phonated pitch contour variation. Weobserve that our method is effective in establishinga relationship between whispered and phonatedformants and in uncovering implicit pitch contoursin whisper.

  • 30.
    Pérez Zarazaga, Pablo
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Juvela, Lauri
    Department of Information and Communications Engineering, Aalto University, Finland.
    Speaker-independent neural formant synthesis2023In: Interspeech 2023, International Speech Communication Association , 2023, p. 5556-5560Conference paper (Refereed)
    Abstract [en]

    We describe speaker-independent speech synthesis driven by a small set of phonetically meaningful speech parameters such as formant frequencies. The intention is to leverage deep-learning advances to provide a highly realistic signal generator that includes control affordances required for stimulus creation in the speech sciences. Our approach turns input speech parameters into predicted mel-spectrograms, which are rendered into waveforms by a pre-trained neural vocoder. Experiments with WaveNet and HiFi-GAN confirm that the method achieves our goals of accurate control over speech parameters combined with high perceptual audio quality. We also find that the small set of phonetically relevant speech parameters we use is sufficient to allow for speaker-independent synthesis (a.k.a. universal vocoding).

  • 31.
    Schulz, Erika
    et al.
    Saarland University.
    Oh, Yoon-mi
    Universite ́ de Lyon and CNRS.
    Malisz, Zofia
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Andreeva, Bistra
    Saarland University.
    Möbius, Bernd
    Saarland University.
    Impact of prosodic structure and information density on vowel space size2016In: Proceedings of Speech Prosody 2016, Boston, MA, 2016Conference paper (Refereed)
    Abstract [en]

    We investigated the influence of prosodic structure and infor- mation density on vowel space size. Vowels were measured in five languages from the BonnTempo corpus, French, German, Finnish, Czech, and Polish, each with three female and three male speakers. Speakers read the text at normal, slow, and fast speech rate. The Euclidean distance between vowel space mid- point and formant values for each speaker was used as a mea- sure for vowel distinctiveness. The prosodic model consisted of prominence and boundary. Information density was calculated for each language using the surprisal of the biphone Xn|Xn−1. On average, there is a positive relationship between vowel space expansion and information density. Detailed analysis revealed that this relationship did not hold for Finnish, and was only weak for Polish. When vowel distinctiveness was modeled as a function of prosodic factors and information density in lin- ear mixed effects models (LMM), only prosodic factors were significant in explaining the variance in vowel space expansion. All prosodic factors, except word boundary, showed significant positive results in LMM. Vowels were more distinct in stressed syllables, before a prosodic boundary and at normal and slow speech rate compared to fast speech. 

  • 32. Trouvain, Jürgen
    et al.
    Malisz, Zofia
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Inter-speech clicks in an Interspeech keynote2016In: Proceedings of INTERSPEECH 2016, San Francisco, CA: International Speech Communication Association, 2016, p. 1397-1401Conference paper (Refereed)
    Abstract [en]

    Clicks are usually described as phoneme realisations in some African languages or as paralinguistic vocalisations, e.g. to signal disapproval or as sound imitation. A more recent discovery is that clicks are, presumably unintentionally, used as discourse markers indexing a new sequence in a conversation or before a word search. In this single-case study, we investigated more than 300 apical clicks of an experienced speaker during a keynote address at an Interspeech conference. The produced clicks occurred only in inter-speech intervals and were often combined with either hesitation particles like "uhm" or audible inhalation. Our observations suggest a link between click production and ingressive airflow as well as indicate that clicks are used as hesitation markers. The rather high frequency of clicks in the analysed sections from the 1- hour-talk shows that in larger discourse, the time between articulatory phases consists of more than silence, audible inhalation and typical hesitation particles. The rather large variation in the intensity and duration and particularly the number of bursts of the observed clicks indicates that this prosodic discourse marker seems to be a rather acoustically inconsistent phonetic category. 

    Download full text (pdf)
    fulltext
  • 33. Wagner, Petra
    et al.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Betz, Simon
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Le Maguer, Sébastien
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Székely, Éva
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Tånnander, Christina
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Speech Synthesis Evaluation: State-of-the-Art Assessment and Suggestion for a Novel Research Program2019In: Proceedings of the 10th Speech Synthesis Workshop (SSW10), 2019Conference paper (Refereed)
  • 34. Wagner, Petra
    et al.
    Inden, Benjamin
    Malisz, Zofia
    Universität Bielefeld.
    Wachsmuth, Ipke
    ’Ja, mhm, ich verstehe dich’: Oszillator-basiertes Timing multimodaler Feedback-Signale in spontanen Dialogen2012In: Elektronische Sprachsignalverarbeitung 2012 (Tagungsband ESSV): Studientexte zur Sprachkommunikation, 2012, Vol. 64, p. 179-187Conference paper (Refereed)
  • 35. Wagner, Petra
    et al.
    Malisz, Zofia
    Universität Bielefeld.
    Inden, Benjamin
    Wachsmuth, Ipke
    Interaction phonology – a temporal co-ordination component enabling representational alignment within a model of communication2013In: Alignment in Communication: Towards a New Theory of Communication / [ed] Wachsmuth, Ipke, Ruiter, Jan de, Jaecks, Petra, Kopp, Stefan, John Benjamins Publishing Company, 2013, Vol. 6, p. 109-132Chapter in book (Refereed)
    Abstract [en]

    This chapter contrasts mechanisms and models of temporal co-ordination with models of representational alignment. We argue that alignment of linguistic representations needs a logistic component explaining the underlying co-ordinative processes between interlocutors in time, yielding a more profound understanding of how information exchange is managed. The processes and structures subject to this logistic component – or Interaction Phonology – must rely on the rhythmic-phonological structure of individual languages. In this way, interlocutors are able to guide their attention to relevant phonetic detail and to attune to the fine-grained organization underlying the linguistic structure encoded in the incoming speech signal. It is furthermore argued that dynamically entraining oscillators provide testable formal models of such a temporal co-ordination between interlocutors’ speech productions and perceptions.

  • 36.
    White, Laurence
    et al.
    Newcastle University, United Kingdom.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Speech rhythm and timing2020In: Oxford Handbook of Language Prosody / [ed] Carlos Gussenhoven and Aoju Chen, Oxford University Press, 2020, p. 166-179Chapter in book (Refereed)
    Abstract [en]

    Speech events do not typically exhibit the temporal regularity conspicuous in many musical rhythms. In the absence of such surface periodicity, hierarchical approaches to speech timing propose that nested prosodic domains, such as syllables and stress-delimited feet, can be modelled as coupled oscillators and that surface timing patterns reflect variation in the relative weights of oscillators. Localized approaches argue, by contrast, that speech timing is largely organized bottom-up, based on segmental identity and subsyllabic organization, with prosodic lengthening effects locally associated with domain heads and edges. This chapter weighs the claims of the two speech timing approaches against empirical data. It also reviews attempts to develop quantitative indices (‘rhythm metrics’) of cross-linguistic variations in surface timing, in particular in the degree of contrast between stronger and weaker syllables. It further reflects on the shortcomings of categorical ‘rhythm class’ typologies in the face of cross-linguistic evidence from speech production and speech perception.

  • 37.
    Zimmerer, Frank
    et al.
    Saarland University.
    Andreeva, Bistra
    Saarland University.
    Möbius, Bernd
    Saarland University.
    Malisz, Zofia
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Ferragne, Emmanuel
    CNRS Université Lyon 2.
    Pellegrino, François
    CNRS Université Lyon 2.
    Brandt, Erika
    Saarland University.
    Perzeption von Sprechgeschwindigkeit und der (nicht nachgewiesene) Einfluss von Surprisal2017In: ESSV - 28. Konferenz Elektronische Sprachsignalverarbeitung 2017, Saarbrücken, 2017Conference paper (Refereed)
    Abstract [de]

    In zwei Perzeptionsexperimenten wurde die Perzeption von Sprech- geschwindigkeit untersucht. Ein Faktor, der dabei besonders im Zentrum des In- teresses steht, ist Surprisal, ein informationstheoretisches Maß für die Vorhersag- barkeit einer linguistischen Einheit im Kontext. Zusammengenommen legen die Ergebnisse der Experimente den Schluss nahe, dass Surprisal keinen signifikanten Einfluss auf die Wahrnehmung von Sprechgeschwindigkeit ausübt. 

  • 38. Żygis, M.
    et al.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Jaskuła, M.
    Kojder, I.
    The involvement of the cerebellum in speech and non- speech motor timing tasks: A behavioural study of patients with cerebellar dysfunctions2019In: Approaches to the Study of Sound Structure and Speech: Interdisciplinary Work in Honour of Katarzyna Dziubalska-Kolaczyk, Informa UK Limited , 2019, p. 227-243Chapter in book (Other academic)
    Abstract [en]

    This study investigates the role of the cerebellum in the production of motor timing tasks by eight individuals with cerebellar lesions and a control group consisting of eight healthy participants. More specifically, it investigates the ability to reproduce metrical patterns by replicating speech and non-speech stimuli (respectively syllables and taps) arranged in different types of prosodic feet (anapests, trochees, etc.). The results show a clear difference in task accomplishment depending on stimulus type but not metrical pattern type. While there were no significant differences between patients and the control group in the repetition of metrical speech stimuli, the non-speech stimuli could not be repeated by patients to match the correct rate of the control group. This suggests that the tapping task involves cerebellar processes while the metrical speech task does not. Our data also reveal a large interspeaker variation within the patient group. 

1 - 38 of 38
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf