kth.sePublications
Change search
Refine search result
1 - 4 of 4
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Pérez Zarazaga, Pablo
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    A processing framework to access large quantities of whispered speech found in ASMR2023In: ICASSP 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece: IEEE Signal Processing Society, 2023Conference paper (Refereed)
    Abstract [en]

    Whispering is a ubiquitous mode of communication that humansuse daily. Despite this, whispered speech has been poorly servedby existing speech technology due to a shortage of resources andprocessing methodology. To remedy this, this paper provides a pro-cessing framework that enables access to large and unique data ofhigh-quality whispered speech. We obtain the data from recordingssubmitted to online platforms as part of the ASMR media-culturalphenomenon. We describe our processing pipeline and a method forimproved whispered activity detection (WAD) in the ASMR data.To efficiently obtain labelled, clean whispered speech, we comple-ment the automatic WAD by using Edyson, a bulk audio annotationtool with human-in-the-loop. We also tackle a problem particular toASMR: separation of whisper from other acoustic triggers presentin the genre. We show that the proposed WAD and the efficient la-belling allows to build extensively augmented data and train a clas-sifier that extracts clean whisper segments from ASMR audio.Our large and growing dataset enables whisper-capable, data-driven speech technology and linguistic analysis. It also opens op-portunities in e.g. HCI as a resource that may elicit emotional, psy-chological and neuro-physiological responses in the listener.

    Download full text (pdf)
    fulltext
  • 2.
    Pérez Zarazaga, Pablo
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Feature Selection for Labelling of Whispered Speech in ASMR Recordings using Edyson2022Conference paper (Refereed)
    Abstract [en]

    Whispered speech is a challenging area for traditional speech processing algorithms, as its properties differ from phonated speech and whispered data is not as easily available. A great amount of whispered speech recordings, however, can be foundin the increasingly popular genre of ASMR in streaming plat-forms like Youtbe or Twitch. Whispered speech is used in thisgenre as a trigger to cause a relaxing sensation in the listener. Accurately separating whispered speech segments from otherauditory triggers would provide a wide variety of whispered data, that could prove useful in improving the performance ofdata driven speech processing methods. We use Edyson as a labelling tool, with which a user can rapidly assign labels tolong segments of audio using an interactive graphical inter-face. In this paper, we propose features that can improve the performance of Edyson with whispered speech and we analyseparameter configurations for different types of sounds. We find Edyson a useful tool for initial labelling of audio data extractedfrom ASMR recordings that can then be used in more complexmodels. Our proposed modifications provide a better sensibil-ity for whispered speech, thus improving the performance of Edyson in the labelling of whispered segments.

  • 3.
    Pérez Zarazaga, Pablo
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Recovering implicit pitch contours from formants in whispered speech2023Conference paper (Refereed)
    Abstract [en]

    Whispered speech is characterised by a noise-likeexcitation that results in the lack of fundamentalfrequency. Considering that prosodic phenomenasuch as intonation are perceived through f0variation, the perception of whispered prosodyis relatively difficult. At the same time,studies have shown that speakers do attemptto produce intonation when whispering and thatprosodic variability is being transmitted, suggestingthat intonation "survives" in whispered formantstructure.In this paper, we aim to estimate the way in whichformant contours correlate with an "implicit" pitchcontour in whisper, using a machine learning model.We propose a two-step method: using a parallelcorpus, we first transform the whispered formantsinto their phonated equivalents using a denoisingautoencoder. We then analyse the formant contoursto predict phonated pitch contour variation. Weobserve that our method is effective in establishinga relationship between whispered and phonatedformants and in uncovering implicit pitch contoursin whisper.

  • 4.
    Pérez Zarazaga, Pablo
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Juvela, Lauri
    Department of Information and Communications Engineering, Aalto University, Finland.
    Speaker-independent neural formant synthesis2023In: Interspeech 2023, International Speech Communication Association , 2023, p. 5556-5560Conference paper (Refereed)
    Abstract [en]

    We describe speaker-independent speech synthesis driven by a small set of phonetically meaningful speech parameters such as formant frequencies. The intention is to leverage deep-learning advances to provide a highly realistic signal generator that includes control affordances required for stimulus creation in the speech sciences. Our approach turns input speech parameters into predicted mel-spectrograms, which are rendered into waveforms by a pre-trained neural vocoder. Experiments with WaveNet and HiFi-GAN confirm that the method achieves our goals of accurate control over speech parameters combined with high perceptual audio quality. We also find that the small set of phonetically relevant speech parameters we use is sufficient to allow for speaker-independent synthesis (a.k.a. universal vocoding).

1 - 4 of 4
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf