Change search
Refine search result
1234567 1 - 50 of 1160
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Pabon, Peter
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Music Acoustics. Royal Conservatoire, The Hague, Netherlands.
    Ternström, Sten
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Music Acoustics.
    Feature maps of the acoustic spectrum of the voice2020In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588, Vol. 34, no 1, p. 161.e1-161.e26Article in journal (Refereed)
    Abstract [en]

    The change in the spectrum of sustained /a/ vowels was mapped over the voice range from low to high fundamental frequency and low to high sound pressure level (SPL), in the form of the so-called voice range profile (VRP). In each interval of one semitone and one decibel, narrowband spectra were averaged both within and across subjects. The subjects were groups of 7 male and 12 female singing students, as well as a group of 16 untrained female voices. For each individual and also for each group, pairs of VRP recordings were made, with stringent separation of the modal/chest and falsetto/head registers. Maps are presented of eight scalar metrics, each of which was chosen to quantify a particular feature of the voice spectrum, over fundamental frequency and SPL. Metrics 1 and 2 chart the role of the fundamental in relation to the rest of the spectrum. Metrics 3 and 4 are used to explore the role of resonances in relation to SPL. Metrics 5 and 6 address the distribution of high frequency energy, while metrics 7 and 8 seek to describe the distribution of energy at the low end of the voice spectrum.

    Several examples are observed of phenomena that are difficult to predict from linear source-filter theory, and of the voice source being less uniform over the voice range than is conventionally assumed. These include a high-frequency band-limiting at high SPL and an unexpected persistence of the second harmonic at low SPL. The two voice registers give rise to clearly different maps. Only a few effects of training were observed, in the low frequency end below 2 kHz. The results are of potential interest in voice analysis, voice synthesis and for new insights into the voice production mechanism.

  • 2.
    Bisesi, Erica
    et al.
    Centre for Systematic Musicology, University of Graz, Graz, Austria.
    Friberg, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Parncutt, Richard
    Centre for Systematic Musicology, University of Graz, Graz, Austria.
    A Computational Model of Immanent Accent Salience in Tonal Music2019In: Frontiers in Psychology, ISSN 1664-1078, E-ISSN 1664-1078, Vol. 10, no 317, p. 1-19Article in journal (Refereed)
    Abstract [en]

    Accents are local musical events that attract the attention of the listener, and can be either immanent (evident from the score) or performed (added by the performer). Immanent accents involve temporal grouping (phrasing), meter, melody, and harmony; performed accents involve changes in timing, dynamics, articulation, and timbre. In the past, grouping, metrical and melodic accents were investigated in the context of expressive music performance. We present a novel computational model of immanent accent salience in tonal music that automatically predicts the positions and saliences of metrical, melodic and harmonic accents. The model extends previous research by improving on preliminary formulations of metrical and melodic accents and introducing a new model for harmonic accents that combines harmonic dissonance and harmonic surprise. In an analysis-by-synthesis approach, model predictions were compared with data from two experiments, respectively involving 239 sonorities and 638 sonorities, and 16 musicians and 5 experts in music theory. Average pair-wise correlations between raters were lower for metrical (0.27) and melodic accents (0.37) than for harmonic accents (0.49). In both experiments, when combining all the raters into a single measure expressing their consensus, correlations between ratings and model predictions ranged from 0.43 to 0.62. When different accent categories of accents were combined together, correlations were higher than for separate categories (r = 0.66). This suggests that raters might use strategies different from individual metrical, melodic or harmonic accent models to mark the musical events.

  • 3.
    Ternström, Sten
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Pabon, Peter
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Royal Conservatoire, The Hague, NL.
    Accounting for variability over the voice range2019In: Proceedings of the ICA 2019 and EAA Euroregio / [ed] Martin Ochmann, Michael Vorländer, Janina Fels, Aachen, DE: Deutsche Gesellschaft für Akustik (DEGA e.V.) , 2019, p. 7775-7780Conference paper (Refereed)
    Abstract [en]

    Researchers from the natural sciences interested in the performing arts often seek quantitative findings with explanatory power and practical relevance to performers and educators. However, the complexity of singing voice production continues to challenge us. On their own, entities that are readily measurable in the domain of physics are rarely of direct relevance to excellence in the domain of performance; because information on one level of representation (e.g., acoustic) is artistically meaningful mostly when interpreted in a context at a higher level of representation (e.g., emotional or semantic). Also, practically any acoustic or physiologic metric derived from the sound of a voice, or from other signals or images, will exhibit considerable variation both across individuals and across the voice range, from soft to loud or from low to high pitch. Here, we review some recent research based on the sampling paradigm of the voice field, also known as the voice range profile. Despite large inter-subject variation, the localizing by fo and SPL in the voice field will make the recorded values very reproducible within subjects. We demonstrate some technical possibilities, and argue the importance of making physical measurements that provide a more encompassing and individual-centric view of singing voice production.

  • 4.
    Zhang, Cheng
    et al.
    Microsoft Res, Cambridge, England..
    Oztireli, Cengiz
    Disney Res, Zurich, Switzerland..
    Mandt, Stephan
    Univ Calif Irvine, Los Angeles, CA USA..
    Salvi, Giampiero
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Active Mini-Batch Sampling Using Repulsive Point Processes2019Conference paper (Refereed)
    Abstract [en]

    The convergence speed of stochastic gradient descent (SGD) can be improved by actively selecting mini-batches. We explore sampling schemes where similar data points are less likely to be selected in the same mini-batch. In particular, we prove that such repulsive sampling schemes lower the variance of the gradient estimator. This generalizes recent work on using Determinantal Point Processes (DPPs) for mini-batch diversification (Zhang et al., 2017) to the broader class of repulsive point processes. We first show that the phenomenon of variance reduction by diversified sampling generalizes in particular to non-stationary point processes. We then show that other point processes may be computationally much more efficient than DPPs. In particular, we propose and investigate Poisson Disk sampling-frequently encountered in the computer graphics community-for this task. We show empirically that our approach improves over standard SGD both in terms of convergence speed as well as final model performance.

  • 5.
    Sundberg, Johan
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Salomão, G. L.
    Scherer, K. R.
    Analyzing Emotion Expression in Singing via Flow Glottograms, Long-Term-Average Spectra, and Expert Listener Evaluation2019In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588Article in journal (Refereed)
    Abstract [en]

    Background: Acoustic aspects of emotional expressivity in speech have been analyzed extensively during recent decades. Emotional coloring is an important if not the most important property of sung performance, and therefore strictly controlled. Hence, emotional expressivity in singing may promote a deeper insight into vocal signaling of emotions. Furthermore, physiological voice source parameters can be assumed to facilitate the understanding of acoustical characteristics. Method: Three highly experienced professional male singers sang scales on the vowel /ae/ or /a/ in 10 emotional colors (Neutral, Sadness, Tender, Calm, Joy, Contempt, Fear, Pride, Love, Arousal, and Anger). Sixteen voice experts classified the scales in a forced-choice listening test, and the result was compared with long-term-average spectrum (LTAS) parameters and with voice source parameters, derived from flow glottograms (FLOGG) that were obtained from inverse filtering the audio signal. Results: On the basis of component analysis, the emotions could be grouped into four “families”, Anger-Contempt, Joy-Love-Pride, Calm-Tender-Neutral and Sad-Fear. Recognition of the intended emotion families by listeners reached accuracy levels far beyond chance level. For the LTAS and FLOGG parameters, vocal loudness had a paramount influence on all. Also after partialing out this factor, some significant correlations were found between FLOGG and LTAS parameters. These parameters could be sorted into groups that were associated with the emotion families. Conclusions: (i) Both LTAS and FLOGG parameters varied significantly with the enactment intentions of the singers. (ii) Some aspects of the voice source are reflected in LTAS parameters. (iii) LTAS parameters affect listener judgment of the enacted emotions and the accuracy of the intended emotional coloring.

  • 6.
    Sundberg, Johan
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. University College of Music Education Stockholm.
    Laís Salomão, Gláucia
    Scherer, Klaus R.
    Analyzing Emotion Expression in Singing via Flow Glottograms, Long-Term-Average Spectra, and Expert Listener Evaluation2019In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588Article in journal (Refereed)
    Abstract [en]

    Acoustic aspects of emotional expressivity in speech have been analyzed extensively during recent decades. Emotional coloring is an important if not the most important property of sung performance, and therefore strictly controlled. Hence, emotional expressivity in singing may promote a deeper insight into vocal signaling of emotions. Furthermore, physiological voice source parameters can be assumed to facilitate the understanding of acoustical characteristics.

    Method

    Three highly experienced professional male singers sang scales on the vowel /ae/ or /a/ in 10 emotional colors (Neutral, Sadness, Tender, Calm, Joy, Contempt, Fear, Pride, Love, Arousal, and Anger). Sixteen voice experts classified the scales in a forced-choice listening test, and the result was compared with long-term-average spectrum (LTAS) parameters and with voice source parameters, derived from flow glottograms (FLOGG) that were obtained from inverse filtering the audio signal.

    Results

    On the basis of component analysis, the emotions could be grouped into four “families”, Anger-Contempt, Joy-Love-Pride, Calm-Tender-Neutral and Sad-Fear. Recognition of the intended emotion families by listeners reached accuracy levels far beyond chance level. For the LTAS and FLOGG parameters, vocal loudness had a paramount influence on all. Also after partialing out this factor, some significant correlations were found between FLOGG and LTAS parameters. These parameters could be sorted into groups that were associated with the emotion families.

    Conclusions

    (i) Both LTAS and FLOGG parameters varied significantly with the enactment intentions of the singers. (ii) Some aspects of the voice source are reflected in LTAS parameters. (iii) LTAS parameters affect listener judgment of the enacted emotions and the accuracy of the intended emotional coloring.

  • 7.
    Kucherenko, Taras
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, Perception and Learning, RPL.
    Hasegawa, Dai
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kaneko, Naoshi
    Kjellström, Hedvig
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, Perception and Learning, RPL.
    Analyzing Input and Output Representations for Speech-Driven Gesture Generation2019In: 19th ACM International Conference on Intelligent Virtual Agents, New York, NY, USA: ACM Publications, 2019Conference paper (Refereed)
    Abstract [en]

    This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates.

    Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences.

    We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.

  • 8.
    Sturm, Bob
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Iglesias, Maria
    Joint Research Centre, European Commission.
    Ben-Tal, Oded
    Kingston University.
    Miron, Marius
    Joint Research Centre, European Commission.
    Gómez, Emilia
    Joint Research Centre, European Commission.
    Artificial Intelligence and Music: Open Questions of Copyright Law and Engineering Praxis2019In: MDPI Arts, ISSN 2076-0752, Vol. 8, no 3, article id 115Article in journal (Refereed)
    Abstract [en]

    The application of artificial intelligence (AI) to music stretches back many decades, and presents numerous unique opportunities for a variety of uses, such as the recommendation of recorded music from massive commercial archives, or the (semi-)automated creation of music. Due to unparalleled access to music data and effective learning algorithms running on high-powered computational hardware, AI is now producing surprising outcomes in a domain fully entrenched in human creativity—not to mention a revenue source around the globe. These developments call for a close inspection of what is occurring, and consideration of how it is changing and can change our relationship with music for better and for worse. This article looks at AI applied to music from two perspectives: copyright law and engineering praxis. It grounds its discussion in the development and use of a specific application of AI in music creation, which raises further and unanticipated questions. Most of the questions collected in this article are open as their answers are not yet clear at this time, but they are nonetheless important to consider as AI technologies develop and are applied more widely to music, not to mention other domains centred on human creativity.

  • 9. Saponaro, Giovanni
    et al.
    Jamone, Lorenzo
    Bernardino, Alexandre
    Salvi, Giampiero
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beyond the Self: Using Grounded Affordances to Interpret and Describe Others’ Actions2019In: IEEE Transactions on Cognitive and Developmental Systems, ISSN 2379-8920Article in journal (Refereed)
    Abstract [en]

    We propose a developmental approach that allows a robot to interpret and describe the actions of human agents by reusing previous experience. The robot first learns the association between words and object affordances by manipulating the objects in its environment. It then uses this information to learn a mapping between its own actions and those performed by a human in a shared environment. It finally fuses the information from these two models to interpret and describe human actions in light of its own experience. In our experiments, we show that the model can be used flexibly to do inference on different aspects of the scene. We can predict the effects of an action on the basis of object properties. We can revise the belief that a certain action occurred, given the observed effects of the human action. In an early action recognition fashion, we can anticipate the effects when the action has only been partially observed. By estimating the probability of words given the evidence and feeding them into a pre-defined grammar, we can generate relevant descriptions of the scene. We believe that this is a step towards providing robots with the fundamental skills to engage in social collaboration with humans.

  • 10.
    Per, Fallgren
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Bringing order to chaos: A non-sequential approach for browsing large sets of found audio data2019In: LREC 2018 - 11th International Conference on Language Resources and Evaluation, European Language Resources Association (ELRA) , 2019, p. 4307-4311Conference paper (Refereed)
    Abstract [en]

    We present a novel and general approach for fast and efficient non-sequential browsing of sound in large archives that we know little or nothing about, e.g. so called found data - data not recorded with the specific purpose to be analysed or used as training data. Our main motivation is to address some of the problems speech and speech technology researchers see when they try to capitalise on the huge quantities of speech data that reside in public archives. Our method is a combination of audio browsing through massively multi-object sound environments and a well-known unsupervised dimensionality reduction algorithm (SOM). We test the process chain on four data sets of different nature (speech, speech and music, farm animals, and farm animals mixed with farm sounds). The methods are shown to combine well, resulting in rapid and readily interpretable observations. Finally, our initial results are demonstrated in prototype software which is freely available.

  • 11.
    Székely, Éva
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    CASTING TO CORPUS: SEGMENTING AND SELECTING SPONTANEOUS DIALOGUE FOR TTS WITH A CNN-LSTM SPEAKER-DEPENDENT BREATH DETECTOR2019In: 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE , 2019, p. 6925-6929Conference paper (Refereed)
    Abstract [en]

    This paper considers utilising breaths to create improved spontaneous-speech corpora for conversational text-to-speech from found audio recordings such as dialogue podcasts. Breaths are of interest since they relate to prosody and speech planning and are independent of language and transcription. Specifically, we propose a semisupervised approach where a fraction of coarsely annotated data is used to train a convolutional and recurrent speaker-specific breath detector operating on spectrograms and zero-crossing rate. The classifier output is used to find target-speaker breath groups (audio segments delineated by breaths) and subsequently select those that constitute clean utterances appropriate for a synthesis corpus. An application to 11 hours of raw podcast audio extracts 1969 utterances (106 minutes), 87% of which are clean and correctly segmented. This outperforms a baseline that performs integrated VAD and speaker attribution without accounting for breaths.

  • 12.
    Rodríguez-Algarra, Francisco
    et al.
    Queen Mary University of London.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Dixon, Simon
    Queen Mary University of London.
    Characterising Confounding Effects in Music Classification Experiments through Interventions2019In: Transactions of the International Society for Music Information Retrieval, p. 52-66Article in journal (Refereed)
    Abstract [en]

    We address the problem of confounding in the design of music classification experiments, that is, the inability to distinguish the effects of multiple potential influencing variables in the measurements. Confounding affects the validity of conclusions at many levels, and so must be properly accounted for. We propose a procedure for characterising effects of confounding in the results of music classification experiments by creating regulated test conditions through interventions in the experimental pipeline, including a novel resampling strategy. We demonstrate this procedure on the GTZAN genre collection, which is known to give rise to confounding effects.

  • 13.
    Jonell, Patrik
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Lopes, J.
    Per, Fallgren
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Wennberg, Ulme
    KTH.
    Doğan, Fethiye Irmak
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Crowdsourcing a self-evolving dialog graph2019In: CUI '19: Proceedings of the 1st International Conference on Conversational User Interfaces, Association for Computing Machinery (ACM), 2019, article id 14Conference paper (Refereed)
    Abstract [en]

    In this paper we present a crowdsourcing-based approach for collecting dialog data for a social chat dialog system, which gradually builds a dialog graph from actual user responses and crowd-sourced system answers, conditioned by a given persona and other instructions. This approach was tested during the second instalment of the Amazon Alexa Prize 2018 (AP2018), both for the data collection and to feed a simple dialog system which would use the graph to provide answers. As users interacted with the system, a graph which maintained the structure of the dialogs was built, identifying parts where more coverage was needed. In an ofine evaluation, we have compared the corpus collected during the competition with other potential corpora for training chatbots, including movie subtitles, online chat forums and conversational data. The results show that the proposed methodology creates data that is more representative of actual user utterances, and leads to more coherent and engaging answers from the agent. An implementation of the proposed method is available as open-source code.

  • 14.
    Gulz, Torbjörn
    et al.
    Royal College of Music in Stockholm, Sweden.
    Holzapfel, Andre
    KTH, School of Electrical Engineering and Computer Science (EECS), Media Technology and Interaction Design, MID.
    Friberg, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Developing a Method for Identifying Improvisation Strategies in Jazz Duos2019In: Proc. of the 14th International Symposium on CMMR / [ed] M. Aramaki, O. Derrien, R. Kronland-Martinet, S. Ystad, Marseille Cedex, 2019, p. 482-489Conference paper (Refereed)
    Abstract [en]

    The primary purpose of this paper is to describe a method to investigate the communication process between musicians performing improvisation in jazz. This method was applied in a first case study. The paper contributes to jazz improvisation theory towards embracing more artistic expressions and choices made in real life musical situations. In jazz, applied improvisation theory usually consists of scale and harmony studies within quantized rhythmic patterns. The ensembles in the study were duos performed by the author at the piano and horn players (trumpet, alto saxophone, clarinet and trombone). Recording sessions involving the ensembles were conducted. The recording was transcribed using software and the produced score together with the audio recording was used when conducting in-depth interviews, to identify the horn player’s underlying musical strategies. The strategies were coded according to previous research.

  • 15.
    Selamtzis, Andreas
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Castellana, Antonella
    Department of Electronics and Telecommunications, Politecnico di Torino, Italy.
    Salvi, Giampiero
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Carullo, Alessio
    Department of Electronics and Telecommunications, Politecnico di Torino, Italy.
    Astolfi, Arianna
    Department of Electronics and Telecommunications, Politecnico di Torino, Italy.
    Effect of vowel context in cepstral and entropy analysis of pathological voices2019In: Biomedical Signal Processing and Control, ISSN 1746-8094, E-ISSN 1746-8108, Vol. 47, p. 350-357Article in journal (Refereed)
    Abstract [en]

    This study investigates the effect of vowel context (excerpted from speech versus sustained) on two voice quality measures: the cepstral peak prominence smoothed (CPPS) and sample entropy (SampEn). Thirty-one dysphonic subjects with different types of organic dysphonia and thirty-one controls read a phonetically balanced text and phonated sustained [a:] vowels in comfortable pitch and loudness. All the [a:] vowels of the read text were excerpted by automatic speech recognition and phonetic (forced) alignment. CPPS and SampEn were calculated for all excerpted vowels of each subject, forming one distribution of CPPS and SampEn values per subject. The sustained vowels were analyzed using a 41 ms window, forming another distribution of CPPS and SampEn values per subject. Two speech-language pathologists performed a perceptual evaluation of the dysphonic subjects’ voice quality from the recorded text. The power of discriminating the dysphonic group from the controls for SampEn and CPPS was assessed for the excerpted and sustained vowels with the Receiver-Operator Characteristic (ROC) analysis. The best discrimination in terms of Area Under Curve (AUC) for CPPS occurred using the mean of the excerpted vowel distributions (AUC=0.85) and for SampEn using the 95th percentile of the sustained vowel distributions (AUC=0.84). CPPS and SampEn were found to be negatively correlated, and the largest correlation was found between the corresponding 95th percentiles of their distributions (Pearson, r=−0.83, p < 10−3). A strong correlation was also found between the 95th percentile of SampEn distributions and the perceptual quality of breathiness (Pearson, r=0.83, p < 10−3). The results suggest that depending on the acoustic voice quality measure, sustained vowels can be more effective than excerpted vowels for detecting dysphonia. Additionally, when using CPPS or SampEn there is an advantage of using the measures’ distributions rather than their average values.

  • 16.
    Patel, Rita
    et al.
    University of Indiana.
    Ternström, Sten
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Music Acoustics.
    Electroglottographic voice maps of untrained vocally healthy adults with gender differences and gradients2019In: Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA): Proc of 11th Int’l Workshop / [ed] Manfredi, C., Firenze, Italy: Firenze University Press, 2019, p. 107-110Conference paper (Refereed)
    Abstract [en]

    Baseline data from adult speakers are presented for time-domain parameters of EGG waveforms. 26 vocally healthy adults (13 males and 13 females) were recruited for the study. Four dependent variables were computed: mean contact quotient, mean peak-rate-of change in the contact area, index of contacting, and the audio crest factor. Small regions around the speech range distribution modes on the fo/SPL plane were used to define means and gradients. Males and females differed considerably in the audio crest factor of their speaking voice, and somewhat in their EGG contact quotient when measured at the mode point of the individual speech range profile. In males, contacting tended to increase somewhat with fo and SPL around the mode point, while in females it tended to decrease.

  • 17. Chettri, B.
    et al.
    Stoller, D.
    Morfi, V.
    Martínez Ramírez, M. A.
    Benetos, E.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Ensemble models for spoofing detection in automatic speaker verification2019In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2019, International Speech Communication Association, 2019, p. 1018-1022Conference paper (Refereed)
    Abstract [en]

    Detecting spoofing attempts of automatic speaker verification (ASV) systems is challenging, especially when using only one modelling approach. For robustness, we use both deep neural networks and traditional machine learning models and combine them as ensemble models through logistic regression. They are trained to detect logical access (LA) and physical access (PA) attacks on the dataset released as part of the ASV Spoofing and Countermeasures Challenge 2019. We propose dataset partitions that ensure different attack types are present during training and validation to improve system robustness. Our ensemble model outperforms all our single models and the baselines from the challenge for both attack types. We investigate why some models on the PA dataset strongly outperform others and find that spoofed recordings in the dataset tend to have longer silences at the end than genuine ones. By removing them, the PA task becomes much more challenging, with the tandem detection cost function (t-DCF) of our best single model rising from 0.1672 to 0.5018 and equal error rate (EER) increasing from 5.98% to 19.8% on the development set.

  • 18.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Pereira, Andre
    Gustafson, Joakim
    Estimating Uncertainty in Task Oriented Dialogue2019In: ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction, 2019Conference paper (Refereed)
    Abstract [en]

    Situated multimodal systems that instruct humans need to handle user uncertainties, as expressed in behaviour, and plan their actions accordingly. Speakers’ decision to reformulate or repair previous utterances depends greatly on the listeners’ signals of uncertainty. In this paper, we estimate uncertainty in a situated guided task, as leveraged in non-verbal cues expressed by the listener, and predict that the speaker will reformulate their utterance. We use a corpus where people instruct how to assemble furniture, and extract their multimodal features. While uncertainty is in cases ver- bally expressed, most instances are expressed non-verbally, which indicates the importance of multimodal approaches. In this work, we present a model for uncertainty estimation. Our findings indicate that uncertainty estimation from non- verbal cues works well, and can exceed human annotator performance when verbal features cannot be perceived.

  • 19.
    Lã, Filipa M.B.
    et al.
    University of Distance-Learning, MADRID, Spain.
    Ternström, Sten
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligenta system, Speech, Music and Hearing, TMH, Music Acoustics.
    Flow ball-assisted training: immediate effects on vocal fold contacting2019In: Pan-European Voice Conference 2019 / [ed] Jenny Iwarsson, Stine Løvind Thorsen, University of Copenhagen , 2019, p. 50-51Conference paper (Refereed)
    Abstract [en]

    Background: The flow ball is a device that creates a static backpressure in the vocal tract while providing real-time visual feedback of airflow. A ball height of 0 to 10 cm corresponds to airflows of 0.2 to 0.4. L/s. These high airflows with low transglottal pressure correspond to low flow resistances, similar to the ones obtained when phonating into straws with 3.7 mm diameter and of 2.8 cm length. Objectives: To investigate whether there are immediate effects of flow ball-assisted training on vocal fold contact. Methods: Ten singers (five males and five females) performed a messa di voce at different pitches over one octave in three different conditions: before, during and after phonating with a flow ball. For all conditions, both audio and electrolaryngographic (ELG) signals were simultaneously recorded using a Laryngograph microprocessor. The vocal fold contact quotient Qci (the area under the normalized EGG cycle) and dEGGmaxN (the normalized maximum rate of change of vocal fold contact area) were obtained for all EGG cycles, using the FonaDyn system. We introduce also a compound metric Ic ,the ‘index of contact’ [Qci × log10(dEGGmaxN)], with the properties that it goes to zero at no contact. It combines information from both Qci and dEGGmaxN and thus it is comparable across subjects. The intra-subject means of all three metrics were computed and visualized by colour-coding over the fo-SPL plane, in cells of 1 semitone × 1 dB. Results: Overall, the use of flow ball-assisted phonation had a small yet significant effect on overall vocal fold contact across the whole messa di voce exercise. Larger effects were evident locally, i.e., in parts of the voice range. Comparing the pre-post flow-ball conditions, there were differences in Qci and/or dEGGmaxN. These differences were generally larger in male than in female voices. Ic typically decreased after flow ball use, for males but not for females. Conclusion: Flow ball-assisted training seems to modify vocal fold contacting gestures, especially in male singers.

  • 20.
    Hallström, Eric
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Mossmyr, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Vegeborn, Victor
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Wedin, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    From Jigs and Reels to Schottisar och Polskor: Generating Scandinavian-like Folk Music with Deep Recurrent Networks2019Conference paper (Refereed)
    Abstract [en]

    The use of recurrent neural networks for modeling and generating music has been shown to be quite effective for compact, textual transcriptions of traditional music from Ireland and the UK. We explore how well these models perform for textual transcriptions of traditional music from Scandinavia. This type of music has characteristics that are similar to and different from that of Irish music, e.g., mode, rhythm, and structure. We investigate the effects of different architectures and training regimens, and evaluate the resulting models using three methods: a comparison of statistics between real and generated transcriptions, an appraisal of generated transcriptions via a semi-structured interview with an expert in Swedish folk music, and an ex- ercise conducted with students of Scandinavian folk music. We find that some of our models can generate new tran- scriptions sharing characteristics with Scandinavian folk music, but which often lack the simplicity of real transcrip- tions. One of our models has been implemented online at http://www.folkrnn.org for anyone to try.

  • 21.
    Mishra, Saumitra
    et al.
    Queen Mary University of London.
    Stoller, Daniel
    Queen Mary University of London.
    Benetos, Emmanouil
    Queen Mary University of London.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Dixon, Simon
    Queen Mary University of London.
    GAN-Based Generation and Automatic Selection of Explanations for Neural Networks2019Conference paper (Refereed)
    Abstract [en]

    One way to interpret trained deep neural networks (DNNs) is by inspecting characteristics that neurons in the model respond to, such as by iteratively optimising themodelinput(e.g.,animage)tomaximallyactivatespecificneurons. However, this requires a careful selection of hyper-parameters to generate interpretable examples for each neuron of interest, and current methods rely on a manual, qualitative evaluation of each setting, which is prohibitively slow. We introduce a new metricthatusesFr´echetInceptionDistance(FID)toencouragesimilaritybetween model activations for real and generated data. This provides an efficient way to evaluateasetofgeneratedexamplesforeachsettingofhyper-parameters. Wealso propose a novel GAN-based method for generating explanations that enables an efficient search through the input space and imposes a strong prior favouring realistic outputs. We apply our approach to a classification model trained to predict whether a music audio recording contains singing voice. Our results suggest that thisproposedmetricsuccessfullyselectshyper-parametersleadingtointerpretable examples, avoiding the need for manual evaluation. Moreover, we see that examples synthesised to maximise or minimise the predicted probability of singing voice presence exhibit vocal or non-vocal characteristics, respectively, suggesting that our approach is able to generate suitable explanations for understanding concepts learned by a neural network.

  • 22.
    Székely, Éva
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    How to train your fillers: uh and um in spontaneous speech synthesis2019Conference paper (Refereed)
  • 23. Finkel, Sebastian
    et al.
    Veit, Ralf
    Lotze, Martin
    Friberg, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Vuust, Peter
    Soekadar, Surjo
    Birbaumer, Niels
    Kleber, Boris
    Intermittent theta burst stimulation over right somatosensory larynx cortex enhances vocal pitch‐regulation in nonsingers2019In: Human Brain Mapping, ISSN 1065-9471, E-ISSN 1097-0193Article in journal (Refereed)
    Abstract [en]

    While the significance of auditory cortical regions for the development and maintenance of speech motor coordination is well established, the contribution of somatosensory brain areas to learned vocalizations such as singing is less well understood. To address these mechanisms, we applied intermittent theta burst stimulation (iTBS), a facilitatory repetitive transcranial magnetic stimulation (rTMS) protocol, over right somatosensory larynx cortex (S1) and a nonvocal dorsal S1 control area in participants without singing experience. A pitch‐matching singing task was performed before and after iTBS to assess corresponding effects on vocal pitch regulation. When participants could monitor auditory feedback from their own voice during singing (Experiment I), no difference in pitch‐matching performance was found between iTBS sessions. However, when auditory feedback was masked with noise (Experiment II), only larynx‐S1 iTBS enhanced pitch accuracy (50–250 ms after sound onset) and pitch stability (>250 ms after sound onset until the end). Results indicate that somatosensory feedback plays a dominant role in vocal pitch regulation when acoustic feedback is masked. The acoustic changes moreover suggest that right larynx‐S1 stimulation affected the preparation and involuntary regulation of vocal pitch accuracy, and that kinesthetic‐proprioceptive processes play a role in the voluntary control of pitch stability in nonsingers. Together, these data provide evidence for a causal involvement of right larynx‐S1 in vocal pitch regulation during singing.

  • 24.
    Sundberg, Johan
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. University College of Music Education Stockholm.
    Intonation in Singing2019In: The Oxford Handbook of Singing / [ed] G Welch, DM Howard, J Nix, Oxford: Oxford University Press, 2019, 1, p. 281-296Chapter in book (Refereed)
    Abstract [en]

    Intonation is an important aspect of sung performances. This chapter presents someinvestigations of vocal ensemble and solo singing. The magnitude of the standarddeviation of f0 (SDF0) in a choral section of a high-quality choir is discussed, as areintonation effects of common partials, the presence of high-frequency partials, andvibrato-free tone production, which all tend to reduce SDF0, i.e., to facilitate intonation.Other topics include accuracy, expressivity, and the role of equally tempered tuning ofaccompanying instruments. Reaction time for f0 correction of intonation errors is slow,suggesting that copying the pitch from fellow singers in an ensemble is not a usefulstrategy. The f0 pattern used in legato performances of sequences of short notes isdescribed as well as solo singers’ intonation strategies in different musical and emotionalcontexts. Likewise, professional music listeners’ tolerance to intonation inaccuracy seemsto vary depending on context.

  • 25.
    Shore, Todd
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Androulakaki, Theofronia
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    KTH Tangrams: A Dataset for Research on Alignment and Conceptual Pacts in Task-Oriented Dialogue2019In: LREC 2018 - 11th International Conference on Language Resources and Evaluation, Tokyo, 2019, p. 768-775Conference paper (Refereed)
    Abstract [en]

    There is a growing body of research focused on task-oriented instructor-manipulator dialogue, whereby one dialogue participant initiates a reference to an entity in a common environment while the other participant must resolve this reference in order to manipulate said entity. Many of these works are based on disparate if nevertheless similar datasets. This paper described an English corpus of referring expressions in relatively free, unrestricted dialogue with physical features generated in a simulation, which facilitate analysis of dialogic linguistic phenomena regarding alignment in the formation of referring expressions known as conceptual pacts.

  • 26.
    Jonell, Patrik
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kucherenko, Taras
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, Perception and Learning, RPL.
    Ekstedt, Erik
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Learning Non-verbal Behavior for a Social Robot from YouTube Videos2019Conference paper (Refereed)
    Abstract [en]

    Non-verbal behavior is crucial for positive perception of humanoid robots. If modeled well it can improve the interaction and leave the user with a positive experience, on the other hand, if it is modelled poorly it may impede the interaction and become a source of distraction. Most of the existing work on modeling non-verbal behavior show limited variability due to the fact that the models employed are deterministic and the generated motion can be perceived as repetitive and predictable. In this paper, we present a novel method for generation of a limited set of facial expressions and head movements, based on a probabilistic generative deep learning architecture called Glow. We have implemented a workflow which takes videos directly from YouTube, extracts relevant features, and trains a model that generates gestures that can be realized in a robot without any post processing. A user study was conducted and illustrated the importance of having any kind of non-verbal behavior while most differences between the ground truth, the proposed method, and a random control were not significant (however, the differences that were significant were in favor of the proposed method).

  • 27.
    Körner Gustafsson, Joakim
    et al.
    Karolinska Institutet.
    Södersten, Maria
    Ternström, Sten
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Music Acoustics.
    Schalling, Ellika
    Long-term effects of Lee Silverman Voice Treatment on daily voice use in Parkinson’s disease as measured with a portable voice accumulator2019In: Logopedics, Phoniatrics, Vocology, ISSN 1401-5439, E-ISSN 1651-2022, ISSN 1401-5439, Vol. 44, no 3, p. 124-133Article in journal (Refereed)
    Abstract [en]

    This study examines the effects of an intensive voice treatment focusing on increasing voice intensity, LSVT LOUD¯ Lee Silverman Voice Treatment, on voice use in daily life in a participant with Parkinson’s disease, using a portable voice accumulator, the VoxLog. A secondary aim was to compare voice use between the participant and a matched healthy control. Participants were an individual with Parkinson’s disease and his healthy monozygotic twin. Voice use was registered with the VoxLog during 9 weeks for the individual with Parkinson’s disease and 2 weeks for the control. This included baseline registrations for both participants, 4 weeks during LSVT LOUD for the individual with Parkinson’s disease and 1 week after treatment for both participants. For the participant with Parkinson’s disease, follow-up registrations at 3, 6, and 12 months post-treatment were made. The individual with Parkinson’s disease increased voice intensity during registrations in daily life with 4.1 dB post-treatment and 1.4 dB at 1-year follow-up compared to before treatment. When monitored during laboratory recordings an increase of 5.6 dB was seen post-treatment and 3.8 dB at 1-year follow-up. Changes in voice intensity were interpreted as a treatment effect as no significant correlations between changes in voice intensity and background noise were found for the individual with Parkinson’s disease. The increase in voice intensity in a laboratory setting was comparable to findings previously reported following LSVT LOUD. The increase registered using ambulatory monitoring in daily life was lower but still reflecting a clinically relevant change.

  • 28.
    Clark, Leigh
    et al.
    Univ Coll Dublin, Dublin, Ireland..
    Cowan, Benjamin R.
    Univ Coll Dublin, Dublin, Ireland..
    Edwards, Justin
    Univ Coll Dublin, Dublin, Ireland..
    Munteanu, Cosmin
    Univ Toronto, Mississauga, ON, Canada.;Univ Toronto, Toronto, ON, Canada..
    Murad, Christine
    Univ Toronto, Mississauga, ON, Canada.;Univ Toronto, Toronto, ON, Canada..
    Aylett, Matthew
    CereProc Ltd, Edinburgh, Midlothian, Scotland..
    Moore, Roger K.
    Univ Sheffield, Sheffield, S Yorkshire, England..
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Székely, Éva
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Healey, Patrick
    Queen Mary Univ London, London, England..
    Harte, Naomi
    Trinity Coll Dublin, Dublin, Ireland..
    Torre, Ilaria
    Trinity Coll Dublin, Dublin, Ireland..
    Doyle, Philip
    Voysis Ltd, Dublin, Ireland..
    Mapping Theoretical and Methodological Perspectives for Understanding Speech Interface Interactions2019In: CHI EA '19 EXTENDED ABSTRACTS: EXTENDED ABSTRACTS OF THE 2019 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, ASSOC COMPUTING MACHINERY , 2019Conference paper (Refereed)
    Abstract [en]

    The use of speech as an interaction modality has grown considerably through the integration of Intelligent Personal Assistants (IPAs- e.g. Siri, Google Assistant) into smartphones and voice based devices (e.g. Amazon Echo). However, there remain significant gaps in using theoretical frameworks to understand user behaviours and choices and how they may applied to specific speech interface interactions. This part-day multidisciplinary workshop aims to critically map out and evaluate theoretical frameworks and methodological approaches across a number of disciplines and establish directions for new paradigms in understanding speech interface user behaviour. In doing so, we will bring together participants from HCI and other speech related domains to establish a cohesive, diverse and collaborative community of researchers from academia and industry with interest in exploring theoretical and methodological issues in the field.

  • 29.
    Elowsson, Anders
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Friberg, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Modeling Music Modality with a Key-Class Invariant Pitch Chroma CNN2019Conference paper (Refereed)
    Abstract [en]

    This paper presents a convolutional neural network (CNN) that uses input from a polyphonic pitch estimation system to predict perceived minor/major modality in music audio. The pitch activation input is structured to allow the first CNN layer to compute two pitch chromas focused on dif-ferent octaves. The following layers perform harmony analysis across chroma and time scales. Through max pooling across pitch, the CNN becomes invariant with re-gards to the key class (i.e., key disregarding mode) of the music. A multilayer perceptron combines the modality ac-tivation output with spectral features for the final predic-tion. The study uses a dataset of 203 excerpts rated by around 20 listeners each, a small challenging data size re-quiring a carefully designed parameter sharing. With an R2 of about 0.71, the system clearly outperforms previous sys-tems as well as individual human listeners. A final ablation study highlights the importance of using pitch activations processed across longer time scales, and using pooling to facilitate invariance with regards to the key class.

  • 30.
    Stefanov, Kalin
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Salvi, Giampiero
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kjellström, Hedvig
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, Perception and Learning, RPL.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Modeling of Human Visual Attention in Multiparty Open-World Dialogues2019In: ACM TRANSACTIONS ON HUMAN-ROBOT INTERACTION, ISSN 2573-9522, Vol. 8, no 2, article id UNSP 8Article in journal (Refereed)
    Abstract [en]

    This study proposes, develops, and evaluates methods for modeling the eye-gaze direction and head orientation of a person in multiparty open-world dialogues, as a function of low-level communicative signals generated by his/hers interlocutors. These signals include speech activity, eye-gaze direction, and head orientation, all of which can be estimated in real time during the interaction. By utilizing these signals and novel data representations suitable for the task and context, the developed methods can generate plausible candidate gaze targets in real time. The methods are based on Feedforward Neural Networks and Long Short-Term Memory Networks. The proposed methods are developed using several hours of unrestricted interaction data and their performance is compared with a heuristic baseline method. The study offers an extensive evaluation of the proposed methods that investigates the contribution of different predictors to the accurate generation of candidate gaze targets. The results show that the methods can accurately generate candidate gaze targets when the person being modeled is in a listening state. However, when the person being modeled is in a speaking state, the proposed methods yield significantly lower performance.

  • 31. Malisz, Zofia
    et al.
    Henter, Gustav Eje
    Valentini-Botinhao, Cassia
    Watts, Oliver
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    Modern speech synthesis for phonetic sciences: A discussion and an evaluation2019In: Proceedings of ICPhS, 2019Conference paper (Refereed)
  • 32. Arnela, Marc
    et al.
    Dabbaghchian, Saeed
    Guasch, Oriol
    Engwall, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligenta system, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    MRI-based vocal tract representations for the three-dimensional finite element synthesis of diphthongs2019In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 27, no 12, p. 2173-2182Article in journal (Refereed)
    Abstract [en]

    The synthesis of diphthongs in three-dimensions (3D) involves the simulation of acoustic waves propagating through a complex 3D vocal tract geometry that deforms over time. Accurate 3D vocal tract geometries can be extracted from Magnetic Resonance Imaging (MRI), but due to long acquisition times, only static sounds can be currently studied with an adequate spatial resolution. In this work, 3D dynamic vocal tract representations are built to generate diphthongs, based on a set of cross-sections extracted from MRI-based vocal tract geometries of static vowel sounds. A diphthong can then be easily generated by interpolating the location, orientation and shape of these cross-sections, thus avoiding the interpolation of full 3D geometries. Two options are explored to extract the cross-sections. The first one is based on an adaptive grid (AG), which extracts the cross-sections perpendicular to the vocal tract midline, whereas the second one resorts to a semi-polar grid (SPG) strategy, which fixes the cross-section orientations. The finite element method (FEM) has been used to solve the mixed wave equation and synthesize diphthongs [${\alpha i}$] and [${\alpha u}$] in the dynamic 3D vocal tracts. The outputs from a 1D acoustic model based on the Transfer Matrix Method have also been included for comparison. The results show that the SPG and AG provide very close solutions in 3D, whereas significant differences are observed when using them in 1D. The SPG dynamic vocal tract representation is recommended for 3D simulations because it helps to prevent the collision of adjacent cross-sections.

  • 33.
    Skantze, Gabriel
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Multimodal Conversational Interaction with Robots2019In: The Handbook of Multimodal-Multisensor Interfaces, Volume 3: Language Processing, Software, Commercialization, and Emerging Directions / [ed] Sharon Oviatt, Björn Schuller, Philip R. Cohen, Daniel Sonntag, Gerasimos Potamianos, Antonio Krüger, ACM Press, 2019Chapter in book (Refereed)
  • 34.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Multimodal Language Grounding for Human-Robot Collaboration: YRRSDS 2019 - Dimosthenis Kontogiorgos2019In: Young Researchers Roundtable on Spoken Dialogue Systems, 2019Conference paper (Refereed)
  • 35.
    Ternström, Sten
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Normalized time-domain parameters for electroglottographic waveforms2019In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 146, no 1, p. EL65-EL70, article id 1.5117174Article in journal (Refereed)
    Abstract [en]

    The electroglottographic waveform is of interest for characterizing phonation non-invasively. Existing parameterizations tend to give disparate results because they rely on somewhat arbitrary thresholds and/or contacting events. It is shown that neither are needed for formulating a normalized contact quotient and a normalized peak derivative. A heuristic combination of the two resolves also the ambiguity of a moderate contact quotient, with regard to vocal fold contacting being firm versus weak or absent. As preliminaries, schemes for electroglottography signal preconditioning and time-domain period detection are described that improve somewhat on similar methods. The algorithms are simple and compute quickly.

  • 36.
    Székely, Éva
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Off the cuff: Exploring extemporaneous speech delivery with TTS2019Conference paper (Refereed)
  • 37.
    Székely, Éva
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Off the cuff: Exploring extemporaneous speech delivery with TTS2019Conference paper (Refereed)
  • 38.
    Kucherenko, Taras
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, Perception and Learning, RPL.
    Hasegawa, Dai
    Hokkai Gakuen University, Sapporo, Japan.
    Naoshi, Kaneko
    Aoyama Gakuin University, Sagamihara, Japan.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kjellström, Hedvig
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, Perception and Learning, RPL.
    On the Importance of Representations for Speech-Driven Gesture Generation: Extended Abstract2019Conference paper (Refereed)
    Abstract [en]

    This paper presents a novel framework for automatic speech-driven gesture generation applicable to human-agent interaction, including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech features as input and produces gestures in the form of sequences of 3D joint coordinates representing motion as output. The results of objective and subjective evaluations confirm the benefits of the representation learning.

  • 39.
    Dubois, Juliette
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. ENSTA ParisTech.
    Elovsson, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Friberg, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Predicting Perceived Dissonance of Piano Chords Using a Chord-Class Invariant CNN and Deep Layered Learning2019In: Proceedings of 16th Sound & Music Computing Conference (SMC), Malaga, Spain, 2019, p. 530-536Conference paper (Refereed)
    Abstract [en]

    This paper presents a convolutional neural network (CNN) able to predict the perceived dissonance of piano chords. Ratings of dissonance for short audio excerpts were com- bined from two different datasets and groups of listeners. The CNN uses two branches in a directed acyclic graph (DAG). The first branch receives input from a pitch esti- mation algorithm, restructured into a pitch chroma. The second branch analyses interactions between close partials, known to affect our perception of dissonance and rough- ness. The analysis is pitch invariant in both branches, fa- cilitated by convolution across log-frequency and octave- wide max-pooling. Ensemble learning was used to im- prove the accuracy of the predictions. The coefficient of determination (R2) between rating and predictions are close to 0.7 in a cross-validation test of the combined dataset. The system significantly outperforms recent computational models. An ablation study tested the impact of the pitch chroma and partial analysis branches separately, conclud- ing that the deep layered learning approach with a pitch chroma was driving the high performance.

  • 40.
    Friberg, Anders
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Bisesi, Erica
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Inst Pasteur, France.
    Addessi, Anna Rita
    Univ Bologna, Dept Educ Studies, Bologna, Italy..
    Baroni, Mario
    Univ Bologna, Dept Arts, Bologna, Italy..
    Probing the Underlying Principles of Perceived Immanent Accents Using a Modeling Approach2019In: Frontiers in Psychology, ISSN 1664-1078, E-ISSN 1664-1078, Vol. 10, article id 1024Article in journal (Refereed)
    Abstract [en]

    This article deals with the question of how the perception of the "immanent accents" can be predicted and modeled. By immanent accent we mean any musical event in the score that is related to important points in the musical structure (e.g., tactus positions, melodic peaks) and is therefore able to capture the attention of a listener. Our aim was to investigate the underlying principles of these accented notes by combining quantitative modeling, music analysis and experimental methods. A listening experiment was conducted where 30 participants indicated perceived accented notes for 60 melodies, vocal and instrumental, selected from Baroque, Romantic and Posttonal styles. This produced a large and unique collection of perceptual data about the perceived immanent accents, organized by styles consisting of vocal and instrumental melodies within Western art music. The music analysis of the indicated accents provided a preliminary list of musical features that could be identified as possible reasons for the raters' perception of the immanent accents. These features related to the score in different ways, e.g., repeated fragments, single notes, or overall structure. A modeling approach was used to quantify the influence of feature groups related to pitch contour, tempo, timing, simple phrasing, and meter. A set of 43 computational features was defined from the music analysis and previous studies and extracted from the score representation. The mean ratings of the participants were predicted using multiple linear regression and support vector regression. The latter method (using cross-validation) obtained the best result of about 66% explained variance (r = 0.81) across all melodies and for a selected group of raters. The independent contribution of each feature group was relatively high for pitch contour and timing (9.6 and 7.0%). There were also significant contributions from tempo (4.5%), simple phrasing (4.4%), and meter (3.9%). Interestingly, the independent contribution varied greatly across participants, implying different listener strategies, and also some variability across different styles. The large differences among listeners emphasize the importance of considering the individual listener's perception in future research in music perception.

  • 41.
    Stefanov, Kalin
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Speech Communication and Technology. University of Southern California.
    Salvi, Giampiero (Contributor)
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition2019In: IEEE Transactions on Cognitive and Developmental Systems, ISSN 2379-8920Article in journal (Refereed)
    Abstract [en]

    This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.

  • 42.
    Kalpakchi, Dmytro
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Boye, Johan
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    SpaceRefNet: a neural approach to spatial reference resolution in a real city environment2019In: Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, Association for Computational Linguistics , 2019, p. 422-431Conference paper (Refereed)
    Abstract [en]

    Adding interactive capabilities to pedestrian wayfinding systems in the form of spoken dialogue will make them more natural to humans. Such an interactive wayfinding system needs to continuously understand and interpret pedestrian’s utterances referring to the spatial context. Achieving this requires the system to identify exophoric referring expressions in the utterances, and link these expressions to the geographic entities in the vicinity. This exophoric spatial reference resolution problem is difficult, as there are often several dozens of candidate referents. We present a neural network-based approach for identifying pedestrian’s references (using a network called RefNet) and resolving them to appropriate geographic objects (using a network called SpaceRefNet). Both methods show promising results beating the respective baselines and earlier reported results in the literature.

  • 43.
    Székely, Éva
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Spontaneous conversational speech synthesis from found data2019Conference paper (Refereed)
  • 44.
    Székely, Éva
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Spontaneous conversational speech synthesis from found data2019Conference paper (Refereed)
  • 45.
    Sundberg, Johan
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    The Acoustics of Different Genres of Singing2019In: The Oxford Handbook of Singing / [ed] G Welch, DM Howard, J Nix, Oxford: Oxford University Press, 2019, 1, p. 167-188Chapter in book (Refereed)
    Abstract [en]

    This chapter reviews several studies of singing in different styles. The studies reveal that the differences concern all the main dimensions of phonation: pitch, loudness, phonation type, and formant frequencies. Most vocal styles differ substantially from normal speech, though in quite different ways. A difficulty in describing the characteristics of the styles of singing, which are typical of different musical genres, is that the same term does not always mean the same to all experts. Some diverging results in voice research on styles of singing emerge from such terminological issues. The author suggests that descriptions of different styles of singing should be related to objective findings on the overall phonatory and articulatory potentials of the voice.

  • 46.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Andersson, Olle
    KTH.
    Koivisto, Marco
    KTH.
    Gonzalez Rabal, Elena
    KTH.
    Vartiainen, Ville
    KTH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    The effects of anthropomorphism and non-verbal social behaviour in virtual assistants2019In: IVA 2019 - Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, Association for Computing Machinery (ACM), 2019, p. 133-140Conference paper (Refereed)
    Abstract [en]

    The adoption of virtual assistants is growing at a rapid pace. However, these assistants are not optimised to simulate key social aspects of human conversational environments. Humans are intellectually biased toward social activity when facing anthropomorphic agents or when presented with subtle social cues. In this paper, we test whether humans respond the same way to assistants in guided tasks, when in different forms of embodiment and social behaviour. In a within-subject study (N=30), we asked subjects to engage in dialogue with a smart speaker and a social robot.We observed shifting of interactive behaviour, as shown in behavioural and subjective measures. Our findings indicate that it is not always favourable for agents to be anthropomorphised or to communicate with nonverbal cues. We found a trade-off between task performance and perceived sociability when controlling for anthropomorphism and social behaviour.

  • 47.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    The Effects of Embodiment and Social Eye-Gaze in Conversational Agents2019In: Proceedings of the 41st Annual Conference of the Cognitive Science Society (CogSci), 2019Conference paper (Refereed)
    Abstract [en]

    The adoption of conversational agents is growing at a rapid pace. Agents however, are not optimised to simulate key social aspects of situated human conversational environments. Humans are intellectually biased towards social activity when facing more anthropomorphic agents or when presented with subtle social cues. In this work, we explore the effects of simulating anthropomorphism and social eye-gaze in three conversational agents. We tested whether subjects’ visual attention would be similar to agents in different forms of embodiment and social eye-gaze. In a within-subject situated interaction study (N=30), we asked subjects to engage in task-oriented dialogue with a smart speaker and two variations of a social robot. We observed shifting of interactive behaviour by human users, as shown in differences in behavioural and objective measures. With a trade-off in task performance, social facilitation is higher with more anthropomorphic social agents when performing the same task.

  • 48. Betz, Simon
    et al.
    Zarrieß, Sina
    Székely, Éva
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Wagner, Petra
    The greennn tree - lengthening position influences uncertainty perception2019In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2019, The International Speech Communication Association (ISCA), 2019, p. 3990-3994Conference paper (Refereed)
    Abstract [en]

    Synthetic speech can be used to express uncertainty in dialogue systems by means of hesitation. If a phrase like “Next to the green tree” is uttered in a hesitant way, that is, containing lengthening, silences, and fillers, the listener can infer that the speaker is not certain about the concepts referred to. However, we do not know anything about the referential domain of the uncertainty; if only a particular word in this sentence would be uttered hesitantly, e.g. “the greee:n tree”, the listener could infer that the uncertainty refers to the color in the statement, but not to the object. In this study, we show that the domain of the uncertainty is controllable. We conducted an experiment in which color words in sentences like “search for the green tree” were lengthened in two different positions: word onsets or final consonants, and participants were asked to rate the uncertainty regarding color and object. The results show that initial lengthening is predominantly associated with uncertainty about the word itself, whereas final lengthening is primarily associated with the following object. These findings enable dialogue system developers to finely control the attitudinal display of uncertainty, adding nuances beyond the lexical content to message delivery.

  • 49.
    Sundberg, Johan
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. University College of Music Education Stockholm.
    The Singing Voice2019In: The Oxcford Handbook of Voice Perception / [ed] S Früholz and P Belin, Oxford: Oxford University Press, 2019, 1, p. 117-142Chapter in book (Refereed)
    Abstract [en]

    The sound quality of singing is determined by three basic factors—the air pressure under the vocal folds (or the subglottal pressure), the mechanical properties of the vocal folds, and the resonance properties of the vocal tract. Subglottal pressure is controlled by the respiratory apparatus. It regulates vocal loudness and is varied with pitch in singing. Together with the mechanical properties of the folds, which are controlled by laryngeal muscles, it has a decisive influence on vocal fold vibrationswhich convert the tracheal airstream to a pulsating airflow, the voice source. The voice source determines pitch, vibrato, and register, and also the overall slope of the spectrum. The sound of the voice source is filtered by the resonances of the vocal tract, or the formants, of which the two lowest determine the vowel quality and the higher ones the personal voice quality. Timing is crucial for creating emotional expressivity; it uses an acoustic code that shows striking similarities to that used in speech. The perceived loudness of a vowel sound seems more closely related to the subglottal pressure with which it was produced than with the acoustical sound level. Some investigations of acoustical correlates of tone placement and variation of larynx height are described, as are properties that affect the perceived naturalness of synthesized singing. Finally, subglottal pressure, voice source, and formant-frequency characteristics of some non-classical styles of singing are discussed.

  • 50.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    The Trade-off between Interaction Time and Social Facilitation with Collaborative Social Robots2019In: The Challenges of Working on Social Robots that Collaborate with People, 2019Conference paper (Refereed)
    Abstract [en]

    The adoption of social robots and conversational agents is growing at a rapid pace. These agents, however, are still not optimised to simulate key social aspects of situated human conversational environments. Humans are intellectually biased towards social activity when facing more anthropomorphic agents or when presented with subtle social cues. In this paper, we discuss the effects of simulating anthropomorphism and non-verbal social behaviour in social robots and its implications for human-robot collaborative guided tasks. Our results indicate that it is not always favourable for agents to be anthropomorphised or to communicate with nonverbal behaviour. We found a clear trade-off between interaction time and social facilitation when controlling for anthropomorphism and social behaviour.

1234567 1 - 50 of 1160
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf