kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 38) Show all publications
Malisz, Z., Foremski, J. & Kul, M. (2024). PRODIS - a speech database and a phoneme-based language model for the study of predictability effects in Polish. In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings: . Paper presented at Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024 (pp. 13068-13073). European Language Resources Association (ELRA)
Open this publication in new window or tab >>PRODIS - a speech database and a phoneme-based language model for the study of predictability effects in Polish
2024 (English)In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, European Language Resources Association (ELRA) , 2024, p. 13068-13073Conference paper, Published paper (Refereed)
Abstract [en]

We present a speech database and a phoneme-level language model of Polish. The database and model are designed for the analysis of prosodic and discourse factors and their impact on acoustic parameters in interaction with predictability effects. The database is also the first large, publicly available Polish speech corpus of excellent acoustic quality that can be used for phonetic analysis and training of multi-speaker speech technology systems. The speech in the database is processed in a pipeline that achieves a 90% degree of automation. It incorporates state-of-the-art, freely available tools enabling database expansion or adaptation to additional languages.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2024
Keywords
database, language model, Polish, probabilistic effects, surprisal
National Category
Natural Language Processing Computer Sciences
Identifiers
urn:nbn:se:kth:diva-348780 (URN)2-s2.0-85195946964 (Scopus ID)
Conference
Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024
Note

Part of ISBN 9782493814104

QC 20240701

Available from: 2024-06-27 Created: 2024-06-27 Last updated: 2025-02-01Bibliographically approved
Pérez Zarazaga, P., Henter, G. E. & Malisz, Z. (2023). A processing framework to access large quantities of whispered speech found in ASMR. In: ICASSP 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Paper presented at ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4-10 June 2023. Rhodes, Greece: IEEE Signal Processing Society
Open this publication in new window or tab >>A processing framework to access large quantities of whispered speech found in ASMR
2023 (English)In: ICASSP 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece: IEEE Signal Processing Society, 2023Conference paper, Published paper (Refereed)
Abstract [en]

Whispering is a ubiquitous mode of communication that humansuse daily. Despite this, whispered speech has been poorly servedby existing speech technology due to a shortage of resources andprocessing methodology. To remedy this, this paper provides a pro-cessing framework that enables access to large and unique data ofhigh-quality whispered speech. We obtain the data from recordingssubmitted to online platforms as part of the ASMR media-culturalphenomenon. We describe our processing pipeline and a method forimproved whispered activity detection (WAD) in the ASMR data.To efficiently obtain labelled, clean whispered speech, we comple-ment the automatic WAD by using Edyson, a bulk audio annotationtool with human-in-the-loop. We also tackle a problem particular toASMR: separation of whisper from other acoustic triggers presentin the genre. We show that the proposed WAD and the efficient la-belling allows to build extensively augmented data and train a clas-sifier that extracts clean whisper segments from ASMR audio.Our large and growing dataset enables whisper-capable, data-driven speech technology and linguistic analysis. It also opens op-portunities in e.g. HCI as a resource that may elicit emotional, psy-chological and neuro-physiological responses in the listener.

Place, publisher, year, edition, pages
Rhodes, Greece: IEEE Signal Processing Society, 2023
Keywords
Whispered speech, WAD, human-in-the-loop, autonomous sensory meridian response
National Category
Signal Processing
Research subject
Information and Communication Technology; Human-computer Interaction; Computer Science; Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-328771 (URN)10.1109/ICASSP49357.2023.10095965 (DOI)2-s2.0-85177548955 (Scopus ID)
Conference
ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4-10 June 2023
Projects
Multimodal encoding of prosodic prominence in voiced and whispered speech
Funder
Swedish Research Council, 2017-02861Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20230630

Available from: 2023-06-29 Created: 2023-06-29 Last updated: 2023-11-29Bibliographically approved
Pérez Zarazaga, P. & Malisz, Z. (2023). Recovering implicit pitch contours from formants in whispered speech. In: : . Paper presented at 20th International Congress of Phonetic Sciences ICPhS 2023,7-11 August, 2023, Prague, Czech Republic.
Open this publication in new window or tab >>Recovering implicit pitch contours from formants in whispered speech
2023 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Whispered speech is characterised by a noise-likeexcitation that results in the lack of fundamentalfrequency. Considering that prosodic phenomenasuch as intonation are perceived through f0variation, the perception of whispered prosodyis relatively difficult. At the same time,studies have shown that speakers do attemptto produce intonation when whispering and thatprosodic variability is being transmitted, suggestingthat intonation "survives" in whispered formantstructure.In this paper, we aim to estimate the way in whichformant contours correlate with an "implicit" pitchcontour in whisper, using a machine learning model.We propose a two-step method: using a parallelcorpus, we first transform the whispered formantsinto their phonated equivalents using a denoisingautoencoder. We then analyse the formant contoursto predict phonated pitch contour variation. Weobserve that our method is effective in establishinga relationship between whispered and phonatedformants and in uncovering implicit pitch contoursin whisper.

Keywords
Whispered speech, Formant contours, Pitch contours, Intonation, Machine learning
National Category
Signal Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-330304 (URN)
Conference
20th International Congress of Phonetic Sciences ICPhS 2023,7-11 August, 2023, Prague, Czech Republic
Projects
Multimodal encoding of prosodic prominence in voiced and whispered speech
Funder
Swedish Research Council, 2017-02861
Note

QC 20230630

Available from: 2023-06-29 Created: 2023-06-29 Last updated: 2023-06-30Bibliographically approved
Pérez Zarazaga, P., Malisz, Z., Henter, G. E. & Juvela, L. (2023). Speaker-independent neural formant synthesis. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland (pp. 5556-5560). International Speech Communication Association
Open this publication in new window or tab >>Speaker-independent neural formant synthesis
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 5556-5560Conference paper, Published paper (Refereed)
Abstract [en]

We describe speaker-independent speech synthesis driven by a small set of phonetically meaningful speech parameters such as formant frequencies. The intention is to leverage deep-learning advances to provide a highly realistic signal generator that includes control affordances required for stimulus creation in the speech sciences. Our approach turns input speech parameters into predicted mel-spectrograms, which are rendered into waveforms by a pre-trained neural vocoder. Experiments with WaveNet and HiFi-GAN confirm that the method achieves our goals of accurate control over speech parameters combined with high perceptual audio quality. We also find that the small set of phonetically relevant speech parameters we use is sufficient to allow for speaker-independent synthesis (a.k.a. universal vocoding).

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
speech synthesis, formant synthesis, neural vocoding
National Category
Signal Processing
Identifiers
urn:nbn:se:kth:diva-329602 (URN)10.21437/Interspeech.2023-1622 (DOI)001186650305148 ()2-s2.0-85171540562 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland
Projects
Multimodal encoding of prosodic prominence in voiced and whispered speech
Funder
Swedish Research Council, 2017-02861Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20241011

Available from: 2023-06-28 Created: 2023-06-28 Last updated: 2024-10-11Bibliographically approved
Pérez Zarazaga, P. & Malisz, Z. (2022). Feature Selection for Labelling of Whispered Speech in ASMR Recordings using Edyson. In: : . Paper presented at 33e svenska fonetikmötet, Fonetik 2022, Stockholm, Sweden, 13-15 juni 2022.
Open this publication in new window or tab >>Feature Selection for Labelling of Whispered Speech in ASMR Recordings using Edyson
2022 (English)Conference paper, Oral presentation with published abstract (Refereed)
Abstract [en]

Whispered speech is a challenging area for traditional speech processing algorithms, as its properties differ from phonated speech and whispered data is not as easily available. A great amount of whispered speech recordings, however, can be foundin the increasingly popular genre of ASMR in streaming plat-forms like Youtbe or Twitch. Whispered speech is used in thisgenre as a trigger to cause a relaxing sensation in the listener. Accurately separating whispered speech segments from otherauditory triggers would provide a wide variety of whispered data, that could prove useful in improving the performance ofdata driven speech processing methods. We use Edyson as a labelling tool, with which a user can rapidly assign labels tolong segments of audio using an interactive graphical inter-face. In this paper, we propose features that can improve the performance of Edyson with whispered speech and we analyseparameter configurations for different types of sounds. We find Edyson a useful tool for initial labelling of audio data extractedfrom ASMR recordings that can then be used in more complexmodels. Our proposed modifications provide a better sensibil-ity for whispered speech, thus improving the performance of Edyson in the labelling of whispered segments.

National Category
Signal Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-329430 (URN)
Conference
33e svenska fonetikmötet, Fonetik 2022, Stockholm, Sweden, 13-15 juni 2022
Projects
Multimodal encoding of prosodic prominence in voiced and whispered speech
Funder
Swedish Research Council, 2017-02861
Note

QC 20230630

Available from: 2023-06-29 Created: 2023-06-29 Last updated: 2023-06-30Bibliographically approved
Beck, G., Wennberg, U., Malisz, Z. & Henter, G. E. (2022). Wavebender GAN: An architecture for phonetically meaningful speech manipulation. In: 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP): . Paper presented at 47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 22-27, 2022, Singapore, SINGAPORE. IEEE conference proceedings
Open this publication in new window or tab >>Wavebender GAN: An architecture for phonetically meaningful speech manipulation
2022 (English)In: 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE conference proceedings, 2022Conference paper, Published paper (Refereed)
Abstract [en]

Deep learning has revolutionised synthetic speech quality. However, it has thus far delivered little value to the speech science community. The new methods do not meet the controllability demands that practitioners in this area require e.g.: in listening tests with manipulated speech stimuli. Instead, control of different speech properties in such stimuli is achieved by using legacy signal-processing methods. This limits the range, accuracy, and speech quality of the manipulations. Also, audible artefacts have a negative impact on the methodological validity of results in speech perception studies.This work introduces a system capable of manipulating speech properties through learning rather than design. The architecture learns to control arbitrary speech properties and leverages progress in neural vocoders to obtain realistic output. Experiments with copy synthesis and manipulation of a small set of core speech features (pitch, formants, and voice quality measures) illustrate the promise of the approach for producing speech stimuli that have accurate control and high perceptual quality.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2022
Series
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-313455 (URN)10.1109/ICASSP43922.2022.9747442 (DOI)000864187906095 ()2-s2.0-85131238464 (Scopus ID)
Conference
47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 22-27, 2022, Singapore, SINGAPORE
Note

Part of proceedings: ISBN 978-1-6654-0540-9

QC 20220607

Available from: 2022-06-03 Created: 2022-06-03 Last updated: 2024-03-15Bibliographically approved
White, L. & Malisz, Z. (2020). Speech rhythm and timing. In: Carlos Gussenhoven and Aoju Chen (Ed.), Oxford Handbook of Language Prosody: (pp. 166-179). Oxford University Press
Open this publication in new window or tab >>Speech rhythm and timing
2020 (English)In: Oxford Handbook of Language Prosody / [ed] Carlos Gussenhoven and Aoju Chen, Oxford University Press, 2020, p. 166-179Chapter in book (Refereed)
Abstract [en]

Speech events do not typically exhibit the temporal regularity conspicuous in many musical rhythms. In the absence of such surface periodicity, hierarchical approaches to speech timing propose that nested prosodic domains, such as syllables and stress-delimited feet, can be modelled as coupled oscillators and that surface timing patterns reflect variation in the relative weights of oscillators. Localized approaches argue, by contrast, that speech timing is largely organized bottom-up, based on segmental identity and subsyllabic organization, with prosodic lengthening effects locally associated with domain heads and edges. This chapter weighs the claims of the two speech timing approaches against empirical data. It also reviews attempts to develop quantitative indices (‘rhythm metrics’) of cross-linguistic variations in surface timing, in particular in the degree of contrast between stronger and weaker syllables. It further reflects on the shortcomings of categorical ‘rhythm class’ typologies in the face of cross-linguistic evidence from speech production and speech perception.

Place, publisher, year, edition, pages
Oxford University Press, 2020
Keywords
contrast; speech events; speech perception; speech production; speech rhythm; speech timing; surface timing; typology
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-283998 (URN)10.1093/oxfordhb/9780198832232.013.10 (DOI)2-s2.0-85136883474 (Scopus ID)
Note

QC 20230619

Available from: 2020-10-13 Created: 2020-10-13 Last updated: 2023-07-14Bibliographically approved
Fallgren, P., Malisz, Z. & Edlund, J. (2019). Bringing order to chaos: A non-sequential approach for browsing large sets of found audio data. In: Proceedings Of The Eleventh International Conference On Language Resources And Evaluation (LREC 2018): . Paper presented at 11th International Conference on Language Resources and Evaluation, LREC 2018, Phoenix Seagaia Conference Center, Miyazaki, Japan, 7 May 2018 through 12 May 2018 (pp. 4307-4311). European Language Resources Association (ELRA)
Open this publication in new window or tab >>Bringing order to chaos: A non-sequential approach for browsing large sets of found audio data
2019 (English)In: Proceedings Of The Eleventh International Conference On Language Resources And Evaluation (LREC 2018), European Language Resources Association (ELRA) , 2019, p. 4307-4311Conference paper, Published paper (Refereed)
Abstract [en]

We present a novel and general approach for fast and efficient non-sequential browsing of sound in large archives that we know little or nothing about, e.g. so called found data - data not recorded with the specific purpose to be analysed or used as training data. Our main motivation is to address some of the problems speech and speech technology researchers see when they try to capitalise on the huge quantities of speech data that reside in public archives. Our method is a combination of audio browsing through massively multi-object sound environments and a well-known unsupervised dimensionality reduction algorithm (SOM). We test the process chain on four data sets of different nature (speech, speech and music, farm animals, and farm animals mixed with farm sounds). The methods are shown to combine well, resulting in rapid and readily interpretable observations. Finally, our initial results are demonstrated in prototype software which is freely available.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2019
Keywords
Data visualisation, Found data, Speech archives
National Category
Other Engineering and Technologies
Identifiers
urn:nbn:se:kth:diva-241799 (URN)000725545004063 ()2-s2.0-85059880464 (Scopus ID)
Conference
11th International Conference on Language Resources and Evaluation, LREC 2018, Phoenix Seagaia Conference Center, Miyazaki, Japan, 7 May 2018 through 12 May 2018
Note

Part of proceedings: ISBN 979-10-95546-00-9

QC 20230206

Available from: 2019-01-25 Created: 2019-01-25 Last updated: 2025-02-18Bibliographically approved
Fallgren, P., Malisz, Z. & Edlund, J. (2019). How to annotate 100 hours in 45 minutes. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH: . Paper presented at Interspeech 2019 15-19 September 2019, Graz (pp. 341-345). ISCA
Open this publication in new window or tab >>How to annotate 100 hours in 45 minutes
2019 (English)In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISCA , 2019, p. 341-345Conference paper, Published paper (Refereed)
Abstract [en]

Speech data found in the wild hold many advantages over artificially constructed speech corpora in terms of ecological validity and cultural worth. Perhaps most importantly, there is a lot of it. However, the combination of great quantity, noisiness and variation poses a challenge for its access and processing. Generally speaking, automatic approaches to tackle the problem require good labels for training, while manual approaches require time. In this study, we provide further evidence for a semi-supervised, human-in-the-loop framework that previously has shown promising results for browsing and annotating large quantities of found audio data quickly. The findings of this study show that a 100-hour long subset of the Fearless Steps corpus can be annotated for speech activity in less than 45 minutes, a fraction of the time it would take traditional annotation methods, without a loss in performance.

Place, publisher, year, edition, pages
ISCA, 2019
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-268304 (URN)10.21437/Interspeech.2019-1648 (DOI)000831796400069 ()2-s2.0-85074718085 (Scopus ID)
Conference
Interspeech 2019 15-19 September 2019, Graz
Note

QC 20200310

Available from: 2020-03-10 Created: 2020-03-10 Last updated: 2025-02-07Bibliographically approved
Malisz, Z., Henter, G. E., Valentini-Botinhao, C., Watts, O., Beskow, J. & Gustafson, J. (2019). Modern speech synthesis for phonetic sciences: A discussion and an evaluation. In: Proceedings of ICPhS: . Paper presented at International Congress of Phonetic Sciencesnces ICPhS 2019 5-9 August 2019, Melbourne, Australia Melbourne Convention and Exhibition Centre.
Open this publication in new window or tab >>Modern speech synthesis for phonetic sciences: A discussion and an evaluation
Show others...
2019 (English)In: Proceedings of ICPhS, 2019Conference paper, Published paper (Refereed)
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-260956 (URN)
Conference
International Congress of Phonetic Sciencesnces ICPhS 2019 5-9 August 2019, Melbourne, Australia Melbourne Convention and Exhibition Centre
Funder
Swedish Research Council, 2017-02861
Note

QC 20191112

Available from: 2019-09-30 Created: 2019-09-30 Last updated: 2025-02-07Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-5953-7310

Search in DiVA

Show all publications