kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 37) Show all publications
Pérez Zarazaga, P., Henter, G. E. & Malisz, Z. (2023). A processing framework to access large quantities of whispered speech found in ASMR. In: ICASSP 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Paper presented at ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4-10 June 2023. Rhodes, Greece: IEEE Signal Processing Society
Open this publication in new window or tab >>A processing framework to access large quantities of whispered speech found in ASMR
2023 (English)In: ICASSP 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece: IEEE Signal Processing Society, 2023Conference paper, Published paper (Refereed)
Abstract [en]

Whispering is a ubiquitous mode of communication that humansuse daily. Despite this, whispered speech has been poorly servedby existing speech technology due to a shortage of resources andprocessing methodology. To remedy this, this paper provides a pro-cessing framework that enables access to large and unique data ofhigh-quality whispered speech. We obtain the data from recordingssubmitted to online platforms as part of the ASMR media-culturalphenomenon. We describe our processing pipeline and a method forimproved whispered activity detection (WAD) in the ASMR data.To efficiently obtain labelled, clean whispered speech, we comple-ment the automatic WAD by using Edyson, a bulk audio annotationtool with human-in-the-loop. We also tackle a problem particular toASMR: separation of whisper from other acoustic triggers presentin the genre. We show that the proposed WAD and the efficient la-belling allows to build extensively augmented data and train a clas-sifier that extracts clean whisper segments from ASMR audio.Our large and growing dataset enables whisper-capable, data-driven speech technology and linguistic analysis. It also opens op-portunities in e.g. HCI as a resource that may elicit emotional, psy-chological and neuro-physiological responses in the listener.

Place, publisher, year, edition, pages
Rhodes, Greece: IEEE Signal Processing Society, 2023
Keywords
Whispered speech, WAD, human-in-the-loop, autonomous sensory meridian response
National Category
Signal Processing
Research subject
Information and Communication Technology; Human-computer Interaction; Computer Science; Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-328771 (URN)10.1109/ICASSP49357.2023.10095965 (DOI)2-s2.0-85177548955 (Scopus ID)
Conference
ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4-10 June 2023
Projects
Multimodal encoding of prosodic prominence in voiced and whispered speech
Funder
Swedish Research Council, 2017-02861Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20230630

Available from: 2023-06-29 Created: 2023-06-29 Last updated: 2023-11-29Bibliographically approved
Pérez Zarazaga, P. & Malisz, Z. (2023). Recovering implicit pitch contours from formants in whispered speech. In: : . Paper presented at 20th International Congress of Phonetic Sciences ICPhS 2023,7-11 August, 2023, Prague, Czech Republic.
Open this publication in new window or tab >>Recovering implicit pitch contours from formants in whispered speech
2023 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Whispered speech is characterised by a noise-likeexcitation that results in the lack of fundamentalfrequency. Considering that prosodic phenomenasuch as intonation are perceived through f0variation, the perception of whispered prosodyis relatively difficult. At the same time,studies have shown that speakers do attemptto produce intonation when whispering and thatprosodic variability is being transmitted, suggestingthat intonation "survives" in whispered formantstructure.In this paper, we aim to estimate the way in whichformant contours correlate with an "implicit" pitchcontour in whisper, using a machine learning model.We propose a two-step method: using a parallelcorpus, we first transform the whispered formantsinto their phonated equivalents using a denoisingautoencoder. We then analyse the formant contoursto predict phonated pitch contour variation. Weobserve that our method is effective in establishinga relationship between whispered and phonatedformants and in uncovering implicit pitch contoursin whisper.

Keywords
Whispered speech, Formant contours, Pitch contours, Intonation, Machine learning
National Category
Signal Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-330304 (URN)
Conference
20th International Congress of Phonetic Sciences ICPhS 2023,7-11 August, 2023, Prague, Czech Republic
Projects
Multimodal encoding of prosodic prominence in voiced and whispered speech
Funder
Swedish Research Council, 2017-02861
Note

QC 20230630

Available from: 2023-06-29 Created: 2023-06-29 Last updated: 2023-06-30Bibliographically approved
Pérez Zarazaga, P., Malisz, Z., Henter, G. E. & Juvela, L. (2023). Speaker-independent neural formant synthesis. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023 (pp. 5556-5560). International Speech Communication Association
Open this publication in new window or tab >>Speaker-independent neural formant synthesis
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 5556-5560Conference paper, Published paper (Refereed)
Abstract [en]

We describe speaker-independent speech synthesis driven by a small set of phonetically meaningful speech parameters such as formant frequencies. The intention is to leverage deep-learning advances to provide a highly realistic signal generator that includes control affordances required for stimulus creation in the speech sciences. Our approach turns input speech parameters into predicted mel-spectrograms, which are rendered into waveforms by a pre-trained neural vocoder. Experiments with WaveNet and HiFi-GAN confirm that the method achieves our goals of accurate control over speech parameters combined with high perceptual audio quality. We also find that the small set of phonetically relevant speech parameters we use is sufficient to allow for speaker-independent synthesis (a.k.a. universal vocoding).

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
speech synthesis, formant synthesis, neural vocoding
National Category
Signal Processing
Identifiers
urn:nbn:se:kth:diva-329602 (URN)10.21437/Interspeech.2023-1622 (DOI)
Conference
24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023
Projects
Multimodal encoding of prosodic prominence in voiced and whispered speech
Funder
Swedish Research Council, 2017-02861Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20230825

Available from: 2023-06-28 Created: 2023-06-28 Last updated: 2023-10-10Bibliographically approved
Pérez Zarazaga, P. & Malisz, Z. (2022). Feature Selection for Labelling of Whispered Speech in ASMR Recordings using Edyson. In: : . Paper presented at 33e svenska fonetikmötet, Fonetik 2022, Stockholm, Sweden, 13-15 juni 2022.
Open this publication in new window or tab >>Feature Selection for Labelling of Whispered Speech in ASMR Recordings using Edyson
2022 (English)Conference paper, Oral presentation with published abstract (Refereed)
Abstract [en]

Whispered speech is a challenging area for traditional speech processing algorithms, as its properties differ from phonated speech and whispered data is not as easily available. A great amount of whispered speech recordings, however, can be foundin the increasingly popular genre of ASMR in streaming plat-forms like Youtbe or Twitch. Whispered speech is used in thisgenre as a trigger to cause a relaxing sensation in the listener. Accurately separating whispered speech segments from otherauditory triggers would provide a wide variety of whispered data, that could prove useful in improving the performance ofdata driven speech processing methods. We use Edyson as a labelling tool, with which a user can rapidly assign labels tolong segments of audio using an interactive graphical inter-face. In this paper, we propose features that can improve the performance of Edyson with whispered speech and we analyseparameter configurations for different types of sounds. We find Edyson a useful tool for initial labelling of audio data extractedfrom ASMR recordings that can then be used in more complexmodels. Our proposed modifications provide a better sensibil-ity for whispered speech, thus improving the performance of Edyson in the labelling of whispered segments.

National Category
Signal Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-329430 (URN)
Conference
33e svenska fonetikmötet, Fonetik 2022, Stockholm, Sweden, 13-15 juni 2022
Projects
Multimodal encoding of prosodic prominence in voiced and whispered speech
Funder
Swedish Research Council, 2017-02861
Note

QC 20230630

Available from: 2023-06-29 Created: 2023-06-29 Last updated: 2023-06-30Bibliographically approved
Beck, G., Wennberg, U., Malisz, Z. & Henter, G. E. (2022). Wavebender GAN: An architecture for phonetically meaningful speech manipulation. In: 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP): . Paper presented at 47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 22-27, 2022, Singapore, SINGAPORE. IEEE conference proceedings
Open this publication in new window or tab >>Wavebender GAN: An architecture for phonetically meaningful speech manipulation
2022 (English)In: 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE conference proceedings, 2022Conference paper, Published paper (Refereed)
Abstract [en]

Deep learning has revolutionised synthetic speech quality. However, it has thus far delivered little value to the speech science community. The new methods do not meet the controllability demands that practitioners in this area require e.g.: in listening tests with manipulated speech stimuli. Instead, control of different speech properties in such stimuli is achieved by using legacy signal-processing methods. This limits the range, accuracy, and speech quality of the manipulations. Also, audible artefacts have a negative impact on the methodological validity of results in speech perception studies.This work introduces a system capable of manipulating speech properties through learning rather than design. The architecture learns to control arbitrary speech properties and leverages progress in neural vocoders to obtain realistic output. Experiments with copy synthesis and manipulation of a small set of core speech features (pitch, formants, and voice quality measures) illustrate the promise of the approach for producing speech stimuli that have accurate control and high perceptual quality.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2022
Series
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-313455 (URN)10.1109/ICASSP43922.2022.9747442 (DOI)000864187906095 ()2-s2.0-85131238464 (Scopus ID)
Conference
47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 22-27, 2022, Singapore, SINGAPORE
Note

Part of proceedings: ISBN 978-1-6654-0540-9

QC 20220607

Available from: 2022-06-03 Created: 2022-06-03 Last updated: 2024-03-15Bibliographically approved
White, L. & Malisz, Z. (2020). Speech rhythm and timing. In: Carlos Gussenhoven and Aoju Chen (Ed.), Oxford Handbook of Language Prosody: (pp. 166-179). Oxford University Press
Open this publication in new window or tab >>Speech rhythm and timing
2020 (English)In: Oxford Handbook of Language Prosody / [ed] Carlos Gussenhoven and Aoju Chen, Oxford University Press, 2020, p. 166-179Chapter in book (Refereed)
Abstract [en]

Speech events do not typically exhibit the temporal regularity conspicuous in many musical rhythms. In the absence of such surface periodicity, hierarchical approaches to speech timing propose that nested prosodic domains, such as syllables and stress-delimited feet, can be modelled as coupled oscillators and that surface timing patterns reflect variation in the relative weights of oscillators. Localized approaches argue, by contrast, that speech timing is largely organized bottom-up, based on segmental identity and subsyllabic organization, with prosodic lengthening effects locally associated with domain heads and edges. This chapter weighs the claims of the two speech timing approaches against empirical data. It also reviews attempts to develop quantitative indices (‘rhythm metrics’) of cross-linguistic variations in surface timing, in particular in the degree of contrast between stronger and weaker syllables. It further reflects on the shortcomings of categorical ‘rhythm class’ typologies in the face of cross-linguistic evidence from speech production and speech perception.

Place, publisher, year, edition, pages
Oxford University Press, 2020
Keywords
contrast; speech events; speech perception; speech production; speech rhythm; speech timing; surface timing; typology
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-283998 (URN)10.1093/oxfordhb/9780198832232.013.10 (DOI)2-s2.0-85136883474 (Scopus ID)
Note

QC 20230619

Available from: 2020-10-13 Created: 2020-10-13 Last updated: 2023-07-14Bibliographically approved
Fallgren, P., Malisz, Z. & Edlund, J. (2019). Bringing order to chaos: A non-sequential approach for browsing large sets of found audio data. In: Proceedings Of The Eleventh International Conference On Language Resources And Evaluation (LREC 2018): . Paper presented at 11th International Conference on Language Resources and Evaluation, LREC 2018, Phoenix Seagaia Conference Center, Miyazaki, Japan, 7 May 2018 through 12 May 2018 (pp. 4307-4311). European Language Resources Association (ELRA)
Open this publication in new window or tab >>Bringing order to chaos: A non-sequential approach for browsing large sets of found audio data
2019 (English)In: Proceedings Of The Eleventh International Conference On Language Resources And Evaluation (LREC 2018), European Language Resources Association (ELRA) , 2019, p. 4307-4311Conference paper, Published paper (Refereed)
Abstract [en]

We present a novel and general approach for fast and efficient non-sequential browsing of sound in large archives that we know little or nothing about, e.g. so called found data - data not recorded with the specific purpose to be analysed or used as training data. Our main motivation is to address some of the problems speech and speech technology researchers see when they try to capitalise on the huge quantities of speech data that reside in public archives. Our method is a combination of audio browsing through massively multi-object sound environments and a well-known unsupervised dimensionality reduction algorithm (SOM). We test the process chain on four data sets of different nature (speech, speech and music, farm animals, and farm animals mixed with farm sounds). The methods are shown to combine well, resulting in rapid and readily interpretable observations. Finally, our initial results are demonstrated in prototype software which is freely available.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2019
Keywords
Data visualisation, Found data, Speech archives
National Category
Media Engineering
Identifiers
urn:nbn:se:kth:diva-241799 (URN)000725545004063 ()2-s2.0-85059880464 (Scopus ID)
Conference
11th International Conference on Language Resources and Evaluation, LREC 2018, Phoenix Seagaia Conference Center, Miyazaki, Japan, 7 May 2018 through 12 May 2018
Note

Part of proceedings: ISBN 979-10-95546-00-9

QC 20230206

Available from: 2019-01-25 Created: 2019-01-25 Last updated: 2023-02-06Bibliographically approved
Fallgren, P., Malisz, Z. & Edlund, J. (2019). How to annotate 100 hours in 45 minutes. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH: . Paper presented at Interspeech 2019 15-19 September 2019, Graz (pp. 341-345). ISCA
Open this publication in new window or tab >>How to annotate 100 hours in 45 minutes
2019 (English)In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISCA , 2019, p. 341-345Conference paper, Published paper (Refereed)
Abstract [en]

Speech data found in the wild hold many advantages over artificially constructed speech corpora in terms of ecological validity and cultural worth. Perhaps most importantly, there is a lot of it. However, the combination of great quantity, noisiness and variation poses a challenge for its access and processing. Generally speaking, automatic approaches to tackle the problem require good labels for training, while manual approaches require time. In this study, we provide further evidence for a semi-supervised, human-in-the-loop framework that previously has shown promising results for browsing and annotating large quantities of found audio data quickly. The findings of this study show that a 100-hour long subset of the Fearless Steps corpus can be annotated for speech activity in less than 45 minutes, a fraction of the time it would take traditional annotation methods, without a loss in performance.

Place, publisher, year, edition, pages
ISCA, 2019
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-268304 (URN)10.21437/Interspeech.2019-1648 (DOI)000831796400069 ()2-s2.0-85074718085 (Scopus ID)
Conference
Interspeech 2019 15-19 September 2019, Graz
Note

QC 20200310

Available from: 2020-03-10 Created: 2020-03-10 Last updated: 2022-09-23Bibliographically approved
Malisz, Z., Henter, G. E., Valentini-Botinhao, C., Watts, O., Beskow, J. & Gustafson, J. (2019). Modern speech synthesis for phonetic sciences: A discussion and an evaluation. In: Proceedings of ICPhS: . Paper presented at International Congress of Phonetic Sciencesnces ICPhS 2019 5-9 August 2019, Melbourne, Australia Melbourne Convention and Exhibition Centre.
Open this publication in new window or tab >>Modern speech synthesis for phonetic sciences: A discussion and an evaluation
Show others...
2019 (English)In: Proceedings of ICPhS, 2019Conference paper, Published paper (Refereed)
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-260956 (URN)
Conference
International Congress of Phonetic Sciencesnces ICPhS 2019 5-9 August 2019, Melbourne, Australia Melbourne Convention and Exhibition Centre
Funder
Swedish Research Council, 2017-02861
Note

QC 20191112

Available from: 2019-09-30 Created: 2019-09-30 Last updated: 2024-03-15Bibliographically approved
Malisz, Z., Berthelsen, H., Beskow, J. & Gustafson, J. (2019). PROMIS: a statistical-parametric speech synthesis system with prominence control via a prominence network. In: Proceedings of SSW 10 - The 10th ISCA Speech Synthesis Workshop: . Paper presented at SSW 10 - The 10th ISCA Speech Synthesis Workshop. Vienna
Open this publication in new window or tab >>PROMIS: a statistical-parametric speech synthesis system with prominence control via a prominence network
2019 (English)In: Proceedings of SSW 10 - The 10th ISCA Speech Synthesis Workshop, Vienna, 2019Conference paper, Published paper (Refereed)
Abstract [en]

We implement an architecture with explicit prominence learning via a prominence network in Merlin, a statistical-parametric DNN-based text-to-speech system. We build on our previous results that successfully evaluated the inclusion of an automatically extracted, speech-based prominence feature into the training and its control at synthesis time. In this work, we expand the PROMIS system by implementing the prominence network that predicts prominence values from text. We test the network predictions as well as the effects of a prominence control module based on SSML-like tags. Listening tests for the complete PROMIS system, combining a prominence feature, a prominence network and prominence control, show that it effectively controls prominence in a diagnostic set of target words. The tests also show a minor negative impact on perceived naturalness, relative to baseline, exerted by the two prominence tagging methods implemented in the control module.

Place, publisher, year, edition, pages
Vienna: , 2019
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-283137 (URN)
Conference
SSW 10 - The 10th ISCA Speech Synthesis Workshop
Funder
Swedish Research Council, 2017-02861
Note

QC 20201020

Available from: 2020-10-05 Created: 2020-10-05 Last updated: 2022-06-25Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-5953-7310

Search in DiVA

Show all publications