kth.sePublications
Change search
Link to record
Permanent link

Direct link
Pérez Zarazaga, PabloORCID iD iconorcid.org/0000-0002-6166-9061
Publications (4 of 4) Show all publications
Pérez Zarazaga, P., Henter, G. E. & Malisz, Z. (2023). A processing framework to access large quantities of whispered speech found in ASMR. In: ICASSP 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Paper presented at ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4-10 June 2023. Rhodes, Greece: IEEE Signal Processing Society
Open this publication in new window or tab >>A processing framework to access large quantities of whispered speech found in ASMR
2023 (English)In: ICASSP 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece: IEEE Signal Processing Society, 2023Conference paper, Published paper (Refereed)
Abstract [en]

Whispering is a ubiquitous mode of communication that humansuse daily. Despite this, whispered speech has been poorly servedby existing speech technology due to a shortage of resources andprocessing methodology. To remedy this, this paper provides a pro-cessing framework that enables access to large and unique data ofhigh-quality whispered speech. We obtain the data from recordingssubmitted to online platforms as part of the ASMR media-culturalphenomenon. We describe our processing pipeline and a method forimproved whispered activity detection (WAD) in the ASMR data.To efficiently obtain labelled, clean whispered speech, we comple-ment the automatic WAD by using Edyson, a bulk audio annotationtool with human-in-the-loop. We also tackle a problem particular toASMR: separation of whisper from other acoustic triggers presentin the genre. We show that the proposed WAD and the efficient la-belling allows to build extensively augmented data and train a clas-sifier that extracts clean whisper segments from ASMR audio.Our large and growing dataset enables whisper-capable, data-driven speech technology and linguistic analysis. It also opens op-portunities in e.g. HCI as a resource that may elicit emotional, psy-chological and neuro-physiological responses in the listener.

Place, publisher, year, edition, pages
Rhodes, Greece: IEEE Signal Processing Society, 2023
Keywords
Whispered speech, WAD, human-in-the-loop, autonomous sensory meridian response
National Category
Signal Processing
Research subject
Information and Communication Technology; Human-computer Interaction; Computer Science; Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-328771 (URN)10.1109/ICASSP49357.2023.10095965 (DOI)2-s2.0-85177548955 (Scopus ID)
Conference
ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4-10 June 2023
Projects
Multimodal encoding of prosodic prominence in voiced and whispered speech
Funder
Swedish Research Council, 2017-02861Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20230630

Available from: 2023-06-29 Created: 2023-06-29 Last updated: 2023-11-29Bibliographically approved
Pérez Zarazaga, P. & Malisz, Z. (2023). Recovering implicit pitch contours from formants in whispered speech. In: : . Paper presented at 20th International Congress of Phonetic Sciences ICPhS 2023,7-11 August, 2023, Prague, Czech Republic.
Open this publication in new window or tab >>Recovering implicit pitch contours from formants in whispered speech
2023 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Whispered speech is characterised by a noise-likeexcitation that results in the lack of fundamentalfrequency. Considering that prosodic phenomenasuch as intonation are perceived through f0variation, the perception of whispered prosodyis relatively difficult. At the same time,studies have shown that speakers do attemptto produce intonation when whispering and thatprosodic variability is being transmitted, suggestingthat intonation "survives" in whispered formantstructure.In this paper, we aim to estimate the way in whichformant contours correlate with an "implicit" pitchcontour in whisper, using a machine learning model.We propose a two-step method: using a parallelcorpus, we first transform the whispered formantsinto their phonated equivalents using a denoisingautoencoder. We then analyse the formant contoursto predict phonated pitch contour variation. Weobserve that our method is effective in establishinga relationship between whispered and phonatedformants and in uncovering implicit pitch contoursin whisper.

Keywords
Whispered speech, Formant contours, Pitch contours, Intonation, Machine learning
National Category
Signal Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-330304 (URN)
Conference
20th International Congress of Phonetic Sciences ICPhS 2023,7-11 August, 2023, Prague, Czech Republic
Projects
Multimodal encoding of prosodic prominence in voiced and whispered speech
Funder
Swedish Research Council, 2017-02861
Note

QC 20230630

Available from: 2023-06-29 Created: 2023-06-29 Last updated: 2023-06-30Bibliographically approved
Pérez Zarazaga, P., Malisz, Z., Henter, G. E. & Juvela, L. (2023). Speaker-independent neural formant synthesis. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023 (pp. 5556-5560). International Speech Communication Association
Open this publication in new window or tab >>Speaker-independent neural formant synthesis
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 5556-5560Conference paper, Published paper (Refereed)
Abstract [en]

We describe speaker-independent speech synthesis driven by a small set of phonetically meaningful speech parameters such as formant frequencies. The intention is to leverage deep-learning advances to provide a highly realistic signal generator that includes control affordances required for stimulus creation in the speech sciences. Our approach turns input speech parameters into predicted mel-spectrograms, which are rendered into waveforms by a pre-trained neural vocoder. Experiments with WaveNet and HiFi-GAN confirm that the method achieves our goals of accurate control over speech parameters combined with high perceptual audio quality. We also find that the small set of phonetically relevant speech parameters we use is sufficient to allow for speaker-independent synthesis (a.k.a. universal vocoding).

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
speech synthesis, formant synthesis, neural vocoding
National Category
Signal Processing
Identifiers
urn:nbn:se:kth:diva-329602 (URN)10.21437/Interspeech.2023-1622 (DOI)
Conference
24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023
Projects
Multimodal encoding of prosodic prominence in voiced and whispered speech
Funder
Swedish Research Council, 2017-02861Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20230825

Available from: 2023-06-28 Created: 2023-06-28 Last updated: 2023-10-10Bibliographically approved
Pérez Zarazaga, P. & Malisz, Z. (2022). Feature Selection for Labelling of Whispered Speech in ASMR Recordings using Edyson. In: : . Paper presented at 33e svenska fonetikmötet, Fonetik 2022, Stockholm, Sweden, 13-15 juni 2022.
Open this publication in new window or tab >>Feature Selection for Labelling of Whispered Speech in ASMR Recordings using Edyson
2022 (English)Conference paper, Oral presentation with published abstract (Refereed)
Abstract [en]

Whispered speech is a challenging area for traditional speech processing algorithms, as its properties differ from phonated speech and whispered data is not as easily available. A great amount of whispered speech recordings, however, can be foundin the increasingly popular genre of ASMR in streaming plat-forms like Youtbe or Twitch. Whispered speech is used in thisgenre as a trigger to cause a relaxing sensation in the listener. Accurately separating whispered speech segments from otherauditory triggers would provide a wide variety of whispered data, that could prove useful in improving the performance ofdata driven speech processing methods. We use Edyson as a labelling tool, with which a user can rapidly assign labels tolong segments of audio using an interactive graphical inter-face. In this paper, we propose features that can improve the performance of Edyson with whispered speech and we analyseparameter configurations for different types of sounds. We find Edyson a useful tool for initial labelling of audio data extractedfrom ASMR recordings that can then be used in more complexmodels. Our proposed modifications provide a better sensibil-ity for whispered speech, thus improving the performance of Edyson in the labelling of whispered segments.

National Category
Signal Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-329430 (URN)
Conference
33e svenska fonetikmötet, Fonetik 2022, Stockholm, Sweden, 13-15 juni 2022
Projects
Multimodal encoding of prosodic prominence in voiced and whispered speech
Funder
Swedish Research Council, 2017-02861
Note

QC 20230630

Available from: 2023-06-29 Created: 2023-06-29 Last updated: 2023-06-30Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-6166-9061

Search in DiVA

Show all publications