Change search
Refine search result
1 - 42 of 42
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1. Ambrazaitis, G.
    et al.
    House, David
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Multimodal prominences: Exploring the patterning and usage of focal pitch accents, head beats and eyebrow beats in Swedish television news readings2017In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 95, p. 100-113Article in journal (Refereed)
    Abstract [en]

    Facial beat gestures align with pitch accents in speech, functioning as visual prominence markers. However, it is not yet well understood whether and how gestures and pitch accents might be combined to create different types of multimodal prominence, and how specifically visual prominence cues are used in spoken communication. In this study, we explore the use and possible interaction of eyebrow (EB) and head (HB) beats with so-called focal pitch accents (FA) in a corpus of 31 brief news readings from Swedish television (four news anchors, 986 words in total), focusing on effects of position in text, information structure as well as speaker expressivity. Results reveal an inventory of four primary (combinations of) prominence markers in the corpus: FA+HB+EB, FA+HB, FA only (i.e., no gesture), and HB only, implying that eyebrow beats tend to occur only in combination with the other two markers. In addition, head beats occur significantly more frequently in the second than in the first part of a news reading. A functional analysis of the data suggests that the distribution of head beats might to some degree be governed by information structure, as the text-initial clause often defines a common ground or presents the theme of the news story. In the rheme part of the news story, FA, HB, and FA+HB are all common prominence markers. The choice between them is subject to variation which we suggest might represent a degree of freedom for the speaker to use the markers expressively. A second main observation concerns eyebrow beats, which seem to be used mainly as a kind of intensification marker for highlighting not only contrast, but also value, magnitude, or emotionally loaded words; it is applicable in any position in a text. We thus observe largely different patterns of occurrence and usage of head beats on the one hand and eyebrow beats on the other, suggesting that the two represent two separate modalities of visual prominence cuing.

  • 2. Borin, L.
    et al.
    Forsberg, M.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Domeij, R.
    Språkbanken 2018: Research resources for text, speech, & society2018In: CEUR Workshop Proceedings, CEUR-WS , 2018, p. 504-506Conference paper (Refereed)
    Abstract [en]

    We introduce an expanded version of the Swedish research resource Språkbanken (the Swedish Language Bank). In 2018, Språkbanken, which has supported national and international research for over four decades, adds two branches, one focusing on speech and one on societal aspects of language, to its existing organization, which targets text. 

  • 3.
    Bresin, Roberto
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Friberg, Anders
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Dahl, Sofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Toward a new model for sound control2001In: Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-01), Limerick, Ireland, December 6-8, 200 / [ed] Fernström, M., Brazil, E., & Marshall, M., 2001, p. 45-49Conference paper (Refereed)
    Abstract [en]

    The control of sound synthesis is a well-known problem. This is particularly true if the sounds are generated with physical modeling techniques that typically need specification of numerous control parameters. In the present work outcomes from studies on automatic music performance are used for tackling this problem. 

  • 4.
    Chettri, Bhusan
    et al.
    Queen Mary Univ London, Sch EECS, London, England..
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Queen Mary Univ London, Sch EECS, London, England..
    Benetos, Emmanouil
    Queen Mary Univ London, Sch EECS, London, England..
    ANALYSING REPLAY SPOOFING COUNTERMEASURE PERFORMANCE UNDER VARIED CONDITIONS2018In: 2018 IEEE 28TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP) / [ed] Pustelnik, N Ma, Z Tan, ZH Larsen, J, IEEE , 2018Conference paper (Refereed)
    Abstract [en]

    In this paper, we aim to understand what makes replay spoofing detection difficult in the context of the ASVspoof 2017 corpus. We use FFT spectra, mel frequency cepstral coefficients (MFCC) and inverted MFCC (IMFCC) frontends and investigate different back-ends based on Convolutional Neural Networks (CNNs), Gaussian Mixture Models (GMMs) and Support Vector Machines (SVMs). On this database, we find that IMFCC frontend based systems show smaller equal error rate (EER) for high quality replay attacks but higher EER for low quality replay attacks in comparison to the baseline. However, we find that it is not straightforward to understand the influence of an acoustic environment (AE), a playback device (PD) and a recording device (RD) of a replay spoofing attack. One reason is the unavailability of metadata for genuine recordings. Second, it is difficult to account for the effects of the factors: AE, PD and RD, and their interactions. Finally, our frame-level analysis shows that the presence of cues (recording artefacts) in the first few frames of genuine signals (missing from replayed ones) influence class prediction.

  • 5.
    Dabbaghchian, Saeed
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Computational Modeling of the Vocal Tract: Applications to Speech Production2018Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Human speech production is a complex process, involving neuromuscular control signals, the effects of articulators' biomechanical properties and acoustic wave propagation in a vocal tract tube of intricate shape. Modeling these phenomena may play an important role in advancing our understanding of the involved mechanisms, and may also have future medical applications, e.g., guiding doctors in diagnosing, treatment planning, and surgery prediction of related disorders, ranging from oral cancer, cleft palate, obstructive sleep apnea, dysphagia, etc.

    A more complete understanding requires models that are as truthful representations as possible of the phenomena. Due to the complexity of such modeling, simplifications have nevertheless been used extensively in speech production research: phonetic descriptors (such as the position and degree of the most constricted part of the vocal tract) are used as control signals, the articulators are represented as two-dimensional geometrical models, the vocal tract is considered as a smooth tube and plane wave propagation is assumed, etc.

    This thesis aims at firstly investigating the consequences of such simplifications, and secondly at contributing to establishing unified modeling of the speech production process, by connecting three-dimensional biomechanical modeling of the upper airway with three-dimensional acoustic simulations. The investigation on simplifying assumptions demonstrated the influence of vocal tract geometry features — such as shape representation, bending and lip shape — on its acoustic characteristics, and that the type of modeling — geometrical or biomechanical — affects the spatial trajectories of the articulators, as well as the transition of formant frequencies in the spectrogram.

    The unification of biomechanical and acoustic modeling in three-dimensions allows to realistically control the acoustic output of dynamic sounds, such as vowel-vowel utterances, by contraction of relevant muscles. This moves and shapes the speech articulators that in turn dene the vocal tract tube in which the wave propagation occurs. The main contribution of the thesis in this line of work is a novel and complex method that automatically reconstructs the shape of the vocal tract from the biomechanical model. This step is essential to link biomechanical and acoustic simulations, since the vocal tract, which anatomically is a cavity enclosed by different structures, is only implicitly defined in a biomechanical model constituted of several distinct articulators.

  • 6.
    Dabbaghchian, Saeed
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Arnela, Marc
    GTM Grup de recerca en Tecnologies Mèdia, La Salle, Universitat Ramon Llull, Barcelona, Spain.
    Engwall, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Guasch, Oriol
    GTM Grup de recerca en Tecnologies Mèdia, La Salle, Universitat Ramon Llull, Barcelona, Spain.
    Reconstruction of vocal tract geometries from biomechanical simulations2018In: International Journal for Numerical Methods in Biomedical Engineering, ISSN 2040-7939, E-ISSN 2040-7947Article in journal (Refereed)
    Abstract [en]

    Medical imaging techniques are usually utilized to acquire the vocal tract geometry in 3D, which may then be used, eg, for acoustic/fluid simulation. As an alternative, such a geometry may also be acquired from a biomechanical simulation, which allows to alter the anatomy and/or articulation to study a variety of configurations. In a biomechanical model, each physical structure is described by its geometry and its properties (such as mass, stiffness, and muscles). In such a model, the vocal tract itself does not have an explicit representation, since it is a cavity rather than a physical structure. Instead, its geometry is defined implicitly by all the structures surrounding the cavity, and such an implicit representation may not be suitable for visualization or for acoustic/fluid simulation. In this work, we propose a method to reconstruct the vocal tract geometry at each time step during the biomechanical simulation. Complexity of the problem, which arises from model alignment artifacts, is addressed by the proposed method. In addition to the main cavity, other small cavities, including the piriform fossa, the sublingual cavity, and the interdental space, can be reconstructed. These cavities may appear or disappear by the position of the larynx, the mandible, and the tongue. To illustrate our method, various static and temporal geometries of the vocal tract are reconstructed and visualized. As a proof of concept, the reconstructed geometries of three cardinal vowels are further used in an acoustic simulation, and the corresponding transfer functions are derived.

  • 7.
    Dabbaghchian, Saeed
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Arnela, Marc
    Engwall, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Guasch, Oriol
    Synthesis of vowels and vowel-vowel utterancesusing a 3D biomechanical-acoustic model2018In: IEEE/ACM Transactions on Audio, Speech, and Language Processing, ISSN 2329-9290Article in journal (Refereed)
    Abstract [en]

    A link is established between a 3D biomechanicaland acoustic model allowing for the umerical synthesis of vowelsounds by contraction of the relevant muscles. That is, thecontraction of muscles in the biomechanical model displacesand deforms the articulators, which in turn deform the vocaltract shape. The mixed wave equation for the acoustic pressureand particle velocity is formulated in an arbitrary Lagrangian-Eulerian framework to account for moving boundaries. Theequations are solved numerically using the finite element method.Since the activation of muscles are not fully known for a givenvowel sound, an inverse method is employed to calculate aplausible activation pattern. For vowel-vowel utterances, two different approaches are utilized: linear interpolation in eithermuscle activation or geometrical space. Although the former isthe natural choice for biomechanical modeling, the latter is usedto investigate the contribution of biomechanical modeling onspeech acoustics. Six vowels [ɑ, ə, ɛ, e, i, ɯ] and three vowel-vowelutterances [ɑi, ɑɯ, ɯi] are synthesized using the 3D model. Results,including articulation, formants, and spectrogram of vowelvowelsounds, are in agreement with previous studies.Comparingthe spectrogram of interpolation in muscle and geometrical spacereveals differences in all frequencies, with the most extendeddifference in the second formant transition.

  • 8.
    Dabbaghchian, Saeed
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Nilsson, Isak
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Engwall, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    From Tongue Movement Data to Muscle Activation – A Preliminary Study of Artisynth's Inverse Modelling2014Conference paper (Other academic)
    Abstract [en]

    Finding the muscle activations during speech production is an important part of developing a comprehensive biomechanical model of speech production. Although there are some direct ways, like Electromyography, for measuring muscle activations, these methods usually are highly invasive and sometimes not reliable. They are more over impossible to use for all muscles. In this study we therefore explore an indirect way to estimate tongue muscle activations during speech production by combining Electromagnetic Articulography (EMA) measurements of tongue movements and the inverse modeling in Artisynth. With EMA we measure the time-changing 3D positions of four sensors attached to the tongue surface for a Swedish female subject producing vowel-vowel and vowelconsonant-vowel (VCV) sequences. The measured sensor positions are used as target points for corresponding virtual sensors introduced in the tongue model of Artisynth’s inverse modelling framework, which computes one possible combination of muscle activations that results in the observed sequence of tongue articulations. We present resynthesized tongue movements in the Artisynth model and verify the results by comparing the calculated muscle activations with literature.

  • 9.
    Elowsson, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Deep Layered Learning in MIRManuscript (preprint) (Other academic)
    Abstract [en]

    Deep learning has boosted the performance of many music information retrieval (MIR) systems in recent years. Yet, the complex hierarchical arrangement of music makes end-to-end learning hard for some MIR tasks – a very deep and structurally flexible processing chain is necessary to extract high-level features from a spectrogram representation. Mid-level representations such as tones, pitched onsets, chords, and beats are fundamental building blocks of music. This paper discusses how these can be used as intermediate representations in MIR to facilitate deep processing that generalizes well: each music concept is predicted individually in learning modules that are connected through latent representations in a directed acyclic graph. It is suggested that this strategy for inference, defined as deep layered learning (DLL), can help generalization by (1) – enforcing the validity of intermediate representations during processing, and by (2) – letting the inferred representations establish disentangled structures that support high-level invariant processing. A background to DLL and modular music processing is provided, and relevant concepts such as pruning, skip connections, and layered performance supervision are reviewed.

  • 10.
    Elowsson, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.
    Modeling Music: Studies of Music Transcription, Music Perception and Music Production2018Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    This dissertation presents ten studies focusing on three important subfields of music information retrieval (MIR): music transcription (Part A), music perception (Part B), and music production (Part C).

    In Part A, systems capable of transcribing rhythm and polyphonic pitch are described. The first two publications present methods for tempo estimation and beat tracking. A method is developed for computing the most salient periodicity (the “cepstroid”), and the computed cepstroid is used to guide the machine learning processing. The polyphonic pitch tracking system uses novel pitch-invariant and tone-shift-invariant processing techniques. Furthermore, the neural flux is introduced – a latent feature for onset and offset detection. The transcription systems use a layered learning technique with separate intermediate networks of varying depth.  Important music concepts are used as intermediate targets to create a processing chain with high generalization. State-of-the-art performance is reported for all tasks.

    Part B is devoted to perceptual features of music, which can be used as intermediate targets or as parameters for exploring fundamental music perception mechanisms. Systems are proposed that can predict the perceived speed and performed dynamics of an audio file with high accuracy, using the average ratings from around 20 listeners as ground truths. In Part C, aspects related to music production are explored. The first paper analyzes long-term average spectrum (LTAS) in popular music. A compact equation is derived to describe the mean LTAS of a large dataset, and the variation is visualized. Further analysis shows that the level of the percussion is an important factor for LTAS. The second paper examines songwriting and composition through the development of an algorithmic composer of popular music. Various factors relevant for writing good compositions are encoded, and a listening test employed that shows the validity of the proposed methods.

    The dissertation is concluded by Part D - Looking Back and Ahead, which acts as a discussion and provides a road-map for future work. The first paper discusses the deep layered learning (DLL) technique, outlining concepts and pointing out a direction for future MIR implementations. It is suggested that DLL can help generalization by enforcing the validity of intermediate representations, and by letting the inferred representations establish disentangled structures supporting high-level invariant processing. The second paper proposes an architecture for tempo-invariant processing of rhythm with convolutional neural networks. Log-frequency representations of rhythm-related activations are suggested at the main stage of processing. Methods relying on magnitude, relative phase, and raw phase information are described for a wide variety of rhythm processing tasks.

  • 11.
    Elowsson, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.
    Polyphonic Pitch Tracking with Deep Layered LearningManuscript (preprint) (Other academic)
    Abstract [en]

    This paper presents a polyphonic pitch tracking system able to extract both framewise and note-based estimates from audio. The system uses six artificial neural networks in a deep layered learning setup. First, cascading networks are applied to a spectrogram for framewise fundamental frequency (f0) estimation. A sparse receptive field is learned by the first network and then used for weight-sharing throughout the system. The f0 activations are connected across time to extract pitch ridges. These ridges define a framework, within which subsequent networks perform tone-shift-invariant onset and offset detection. The networks convolve the pitch ridges across time, using as input, e.g., variations of latent representations from the f0 estimation networks, defined as the “neural flux.” Finally, incorrect tentative notes are removed one by one in an iterative procedure that allows a network to classify notes within an accurate context. The system was evaluated on four public test sets: MAPS, Bach10, TRIOS, and the MIREX Woodwind quintet, and performed state-of-the-art results for all four datasets. It performs well across all subtasks: f0, pitched onset, and pitched offset tracking.

  • 12.
    Elowsson, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.
    Tempo-Invariant Processing of Rhythm with Convolutional Neural NetworksManuscript (preprint) (Other academic)
    Abstract [en]

    Rhythm patterns can be performed with a wide variation of tempi. This presents a challenge for many music information retrieval (MIR) systems; ideally, perceptually similar rhythms should be represented and processed similarly, regardless of the specific tempo at which they were performed. Several recent systems for tempo estimation, beat tracking, and downbeat tracking have therefore sought to process rhythm in a tempo-invariant way, often by sampling input vectors according to a precomputed pulse level. This paper describes how a log-frequency representation of rhythm-related activations instead can promote tempo invariance when processed with convolutional neural networks. The strategy incorporates invariance at a fundamental level and can be useful for most tasks related to rhythm processing. Different methods are described, relying on magnitude, phase relationships of different rhythm channels, as well as raw phase information. Several variations are explored to provide direction for future implementations.

  • 13.
    Fallgren, P.
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Malisz, Z.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    A tool for exploring large amounts of found audio data2018In: CEUR Workshop Proceedings, CEUR-WS , 2018, p. 499-503Conference paper (Refereed)
    Abstract [en]

    We demonstrate a method and a set of open source tools (beta) for nonsequential browsing of large amounts of audio data. The demonstration will contain versions of a set of functionalities in their first stages, and will provide a good insight in how the method can be used to browse through large quantities of audio data efficiently.

  • 14.
    Fermoselle, Leonor
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Designing joint attention systems for robots that assist children with autism spectrum disorders2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Joint attention behaviours play a central role in natural and believable human-robot interactions. This research presents the design decisions of a semi-autonomous joint attention robotic system, together with the evaluation of its effectiveness and perceived social presence across different cognitive ability groups. For this purpose, two different studies were carried out: first with adults, and then with children between 10 and 12 years-old.

    The overall results for both studies reflect a system that is perceived as socially present and engaging which can successfully establish joint attention with the participants. When comparing the performance results between the two groups, children achieved higher joint attention scores and reported a higher level of enjoyment and helpfulness in the interaction.

    Furthermore, a detailed literature review on robot-assisted therapies for children with autism spectrum disorders is presented, focusing on the development of joint attention skills. The children’s positive interaction results from the studies, together with state-of-the-art research therapies and the input from an autism therapist, guided the author to elaborate some design guidelines for a robotic system to assist in joint attention focused autism therapies.

  • 15.
    Friberg, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Commentary on Polak How short is the shortest metric subdivision?2017In: Empirical Musicology Review, ISSN 1559-5749, E-ISSN 1559-5749, Vol. 12, no 3-4, p. 227-228Article in journal (Other academic)
    Abstract [en]

    This commentary relates to the target paper by Polak on the shortest metric subdivision by presenting measurements on West-African drum music. It provides new evidence that the perceptual lower limit of tone duration is within the range 80-100 ms. Using fairly basic measurement techniques in combination with a musical analysis of the content, the original results in this study represents a valuable addition to the literature. Considering the relevance for music listening, further research would be valuable for determining and understanding the nature of this perceptual limit.

  • 16.
    Friberg, Anders
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Lindeberg, Tony
    KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
    Hellwagner, Martin
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Helgason, Pétur
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Salomão, Gláucia Laís
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Elovsson, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Lemaitre, Guillaume
    Institute for Research and Coordination in Acoustics and Music, Paris, France.
    Ternström, Sten
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Prediction of three articulatory categories in vocal sound imitations using models for auditory receptive fields2018In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524Article in journal (Refereed)
    Abstract [en]

    Vocal sound imitations provide a new challenge for understanding the coupling between articulatory mechanisms and the resulting audio. In this study, we have modeled the classification of three articulatory categories, phonation, supraglottal myoelastic vibrations, and turbulence from audio recordings. Two data sets were assembled, consisting of different vocal imitations by four professional imitators and four non-professional speakers in two different experiments. The audio data were manually annotated by two experienced phoneticians using a detailed articulatory description scheme. A separate set of audio features was developed specifically for each category using both time-domain and spectral methods. For all time-frequency transformations, and for some secondary processing, the recently developed Auditory Receptive Fields Toolbox was used. Three different machine learning methods were applied for predicting the final articulatory categories. The result with the best generalization was found using an ensemble of multilayer perceptrons. The cross-validated classification accuracy was 96.8 % for phonation, 90.8 % for supraglottal myoelastic vibrations, and 89.0 % for turbulence using all the 84 developed features. A final feature reduction to 22 features yielded similar results.

  • 17.
    Holzapfel, André
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Media Technology and Interaction Design, MID.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Coeckelbergh, Mark
    Department of Philosophy, University of Vienna, Vienna, Austria.
    Ethical Dimensions of Music Information Retrieval Technology2018In: Transactions of the International Society for Music Information Retrieval, E-ISSN 2514-3298, Vol. 1, no 1, p. 44-55Article in journal (Refereed)
    Abstract [en]

    This article examines ethical dimensions of Music Information Retrieval (MIR) technology.  It uses practical ethics (especially computer ethics and engineering ethics) and socio-technical approaches to provide a theoretical basis that can inform discussions of ethics in MIR. To help ground the discussion, the article engages with concrete examples and discourse drawn from the MIR field. This article argues that MIR technology is not value-neutral but is influenced by design choices, and so has unintended and ethically relevant implications. These can be invisible unless one considers how the technology relates to wider society. The article points to the blurring of boundaries between music and technology, and frames music as “informationally enriched” and as a “total social fact.” The article calls attention to biases that are introduced by algorithms and data used for MIR technology, cultural issues related to copyright, and ethical problems in MIR as a scientific practice. The article concludes with tentative ethical guidelines for MIR developers, and calls for addressing key ethical problems with MIR technology and practice, especially those related to forms of bias and the remoteness of the technology development from end users.

  • 18.
    Jansson, Erik V.
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kabała, A.
    On the influence of arching and material on the vibration of a shell - Towards understanding the soloist violin2018In: Vibrations in Physical Systems, ISSN 0860-6897, Vol. 29, article id 2018027Article in journal (Refereed)
    Abstract [en]

    A study of the results of FEM simulations of plate and shell models are presented to reference of a violin vibrations problems. The influence of arching, variable thickness and damping was considered. ABAQUS/Explicit procedure of “Dynamic Explicit” was used in the simulation. Anisotropy in the material properties (spruce) was considered (9 elastic constants).

  • 19.
    Jonell, Patrik
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Mattias, Bystedt
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Per, Fallgren
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    David Aguas Lopes, José
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Samuel, Mascarenhas
    GAIPS INESC-ID, Lisbon, Portugal.
    Oertel, Catharine
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Eran, Raveh
    Multimodal Computing and Interaction, Saarland University, Germany.
    Shore, Todd
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    FARMI: A Framework for Recording Multi-Modal Interactions2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris: European Language Resources Association, 2018, p. 3969-3974Conference paper (Refereed)
    Abstract [en]

    In this paper we present (1) a processing architecture used to collect multi-modal sensor data, both for corpora collection and real-time processing, (2) an open-source implementation thereof and (3) a use-case where we deploy the architecture in a multi-party deception game, featuring six human players and one robot. The architecture is agnostic to the choice of hardware (e.g. microphones, cameras, etc.) and programming languages, although our implementation is mostly written in Python. In our use-case, different methods of capturing verbal and non-verbal cues from the participants were used. These were processed in real-time and used to inform the robot about the participants’ deceptive behaviour. The framework is of particular interest for researchers who are interested in the collection of multi-party, richly recorded corpora and the design of conversational systems. Moreover for researchers who are interested in human-robot interaction the available modules offer the possibility to easily create both autonomous and wizard-of-Oz interactions.

  • 20.
    Jonell, Patrik
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Oertel, Catharine
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Crowdsourced Multimodal Corpora Collection Tool2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 728-734Conference paper (Refereed)
    Abstract [en]

    In recent years, more and more multimodal corpora have been created. To our knowledge there is no publicly available tool which allows for acquiring controlled multimodal data of people in a rapid and scalable fashion. We therefore are proposing (1) a novel tool which will enable researchers to rapidly gather large amounts of multimodal data spanning a wide demographic range, and (2) an example of how we used this tool for corpus collection of our "Attentive listener'' multimodal corpus. The code is released under an Apache License 2.0 and available as an open-source repository, which can be found at https://github.com/kth-social-robotics/multimodal-crowdsourcing-tool. This tool will allow researchers to set-up their own multimodal data collection system quickly and create their own multimodal corpora. Finally, this paper provides a discussion about the advantages and disadvantages with a crowd-sourced data collection tool, especially in comparison to a lab recorded corpora.

  • 21.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Avramova, Vanya
    KTH.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Jonell, Patrik
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Oertel, Catharine
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 119-127Conference paper (Refereed)
    Abstract [en]

    In this paper we present a corpus of multiparty situated interaction where participants collaborated on moving virtual objects on a large touch screen. A moderator facilitated the discussion and directed the interaction. The corpus contains recordings of a variety of multimodal data, in that we captured speech, eye gaze and gesture data using a multisensory setup (wearable eye trackers, motion capture and audio/video). Furthermore, in the description of the multimodal corpus, we investigate four different types of social gaze: referential gaze, joint attention, mutual gaze and gaze aversion by both perspectives of a speaker and a listener. We annotated the groups’ object references during object manipulation tasks and analysed the group’s proportional referential eye-gaze with regards to the referent object. When investigating the distributions of gaze during and before referring expressions we could corroborate the differences in time between speakers’ and listeners’ eye gaze found in earlier studies. This corpus is of particular interest to researchers who are interested in social eye-gaze patterns in turn-taking and referring language in situated multi-party interaction.

  • 22.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Sibirtseva, Elena
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Pereira, André
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Multimodal reference resolution in collaborative assembly tasks2018Conference paper (Refereed)
  • 23.
    Malisz, Zofia
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Żygis, M.
    Lexical stress in Polish: Evidence from focus and phrase-position differentiated production data2018In: Proceedings of the International Conference on Speech Prosody, International Speech Communications Association , 2018, p. 1008-1012Conference paper (Refereed)
    Abstract [en]

    We examine acoustic patterns of word stress in Polish in data with carefully separated phrase- and word-level prominences. We aim to verify claims in the literature regarding the phonetic and phonological status of lexical stress (both primary and secondary) in Polish and to contribute to a better understanding of prosodic prominence and boundary interactions. Our results show significant effects of primary stress on acoustic parameters such as duration, f0 functionals and spectral emphasis expected for a stress language. We do not find clear and systematic acoustic evidence for secondary stress.

  • 24.
    Näslund, Per
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Artificial Neural Networks in Swedish Speech Synthesis2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Text-to-speech (TTS) systems have entered our daily lives in the form of smart assistants and many other applications. Contemporary re- search applies machine learning and artificial neural networks (ANNs) to synthesize speech. It has been shown that these systems outperform the older concatenative and parametric methods.

    In this paper, ANN-based methods for speech synthesis are ex- plored and one of the methods is implemented for the Swedish lan- guage. The implemented method is dubbed “Tacotron” and is a first step towards end-to-end ANN-based TTS which puts many differ- ent ANN-techniques to work. The resulting system is compared to a parametric TTS through a strength-of-preference test that is carried out with 20 Swedish speaking subjects. A statistically significant pref- erence for the ANN-based TTS is found. Test subjects indicate that the ANN-based TTS performs better than the parametric TTS when it comes to audio quality and naturalness but sometimes lacks in intelli- gibility.

  • 25.
    Pabon, Peter
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Institute of Sonology, Royal Conservatory.
    Mapping Individual Voice Quality over the Voice Range: The Measurement Paradigm of the Voice Range Profile2018Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    The acoustic signal of voiced sounds has two primary attributes: fundamental frequency and sound level. It has also very many secondary attributes, or ‘voice qualities’, that can be derived from the acoustic signal, in particular from its periodicity and its spectrum. Acoustic voice analysis as a discipline is largely concerned with identifying and quantifying those qualities or parameters that are relevant for assessing the health or training status of a voice or that characterize the individual quality. The thesis presented here is that all such voice qualities covary essentially and individually with the fundamental frequency and the sound level, and that methods for assessing the voice must account for this covariation and individuality. The central interest in the "voice field" measurement paradigm becomes to map the proportional dependencies that exist between voice parameters. The five studies contribute to ways of doing this in practice, while the framework text presents the theoretical basis for the analysis model in relation to the practical principles.

  • 26.
    Pabon, Peter
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics. Royal Conservatoire, The Hague, Netherlands.
    Ternström, Sten
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.
    Feature maps of the acoustic spectrum of the voice2018In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588Article in journal (Refereed)
    Abstract [en]

    The change in the spectrum of sustained /a/ vowels was mapped over the voice range from low to high fundamentalfrequency and low to high sound pressure level (SPL), in the form of the so-called voice range profile (VRP). In eachinterval of one semitone and one decibel, narrowband spectra were averaged both within and across subjects. Thesubjects were groups of 7 male and 12 female singing students, as well as a group of 16 untrained female voices. Foreach individual and also for each group, pairs of VRP recordings were made, with stringent separation of themodal/chest and falsetto/head registers. Maps are presented of eight scalar metrics, each of which was chosen toquantify a particular feature of the voice spectrum, over fundamental frequency and SPL. Metrics 1 and 2 chart the roleof the fundamental in relation to the rest of the spectrum. Metrics 3 and 4 are used to explore the role of resonances inrelation to SPL. Metrics 5 and 6 address the distribution of high frequency energy, while metrics 7 and 8 seek todescribe the distribution of energy at the low end of the voice spectrum. Several examples are observed ofphenomena that are difficult to predict from linear source-filter theory, and of the voice source being less uniform overthe voice range than is conventionally assumed. These include a high-frequency band-limiting at high SPL and anunexpected persistence of the second harmonic at low SPL. The two voice registers give rise to clearly different maps.Only a few effects of training were observed, in the low frequency end below 2 kHz. The results are of potentialinterest in voice analysis, voice synthesis and for new insights into the voice production mechanism.

  • 27.
    Schoonderwaldt, Erwin
    et al.
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Friberg, Anders
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Bresin, Roberto
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Juslin, P. N.
    Uppsala University.
    A system for improving the communication of emotion in music performance by feedback learning2002In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 111, no 5, p. 2471-Article in journal (Refereed)
    Abstract [en]

    Expressivity is one of the most important aspects of music performance. However, in music education, expressivity is often overlooked in favor of technical abilities. This could possibly depend on the difficulty in describing expressivity, which makes it problematic to provide the student with specific feedback. The aim of this project is to develop a computer program, which will improve the students’ ability in communicating emotion in music performance. The expressive intention of a performer can be coded in terms of performance parameters (cues), such as tempo, sound level, timbre, and articulation. Listeners’ judgments can be analyzed in the same terms. An algorithm was developed for automatic cue extraction from audio signals. Using note onset–offset detection, the algorithm yields values of sound level, articulation, IOI, and onset velocity for each note. In previous research, Juslin has developed a method for quantitative evaluation of performer–listener communication. This framework forms the basis of the present program. Multiple regression analysis on performances of the same musical fragment, played with different intentions, determines the relative importance of each cue and the consistency of cue utilization. Comparison with built‐in listener models, simulating perceived expression using a regression equation, provides detailed feedback regarding the performers’ cue utilization.

  • 28.
    Selamtzis, Andreas
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Castellana, Antonella
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Salvi, Giampiero
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Carullo, A.
    Astolfi, Arianna
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Effect of vowel context in cepstral and entropy analysis of pathological voices2019In: Biomedical Signal Processing and Control, ISSN 1746-8094, E-ISSN 1746-8108, Vol. 47, p. 350-357Article in journal (Refereed)
    Abstract [en]

    This study investigates the effect of vowel context (excerpted from speech versus sustained) on two voice quality measures: the cepstral peak prominence smoothed (CPPS) and sample entropy (SampEn). Thirty-one dysphonic subjects with different types of organic dysphonia and thirty-one controls read a phonetically balanced text and phonated sustained [a:] vowels in comfortable pitch and loudness. All the [a:] vowels of the read text were excerpted by automatic speech recognition and phonetic (forced) alignment. CPPS and SampEn were calculated for all excerpted vowels of each subject, forming one distribution of CPPS and SampEn values per subject. The sustained vowels were analyzed using a 41 ms window, forming another distribution of CPPS and SampEn values per subject. Two speech-language pathologists performed a perceptual evaluation of the dysphonic subjects’ voice quality from the recorded text. The power of discriminating the dysphonic group from the controls for SampEn and CPPS was assessed for the excerpted and sustained vowels with the Receiver-Operator Characteristic (ROC) analysis. The best discrimination in terms of Area Under Curve (AUC) for CPPS occurred using the mean of the excerpted vowel distributions (AUC=0.85) and for SampEn using the 95th percentile of the sustained vowel distributions (AUC=0.84). CPPS and SampEn were found to be negatively correlated, and the largest correlation was found between the corresponding 95th percentiles of their distributions (Pearson, r=−0.83, p < 10−3). A strong correlation was also found between the 95th percentile of SampEn distributions and the perceptual quality of breathiness (Pearson, r=0.83, p < 10−3). The results suggest that depending on the acoustic voice quality measure, sustained vowels can be more effective than excerpted vowels for detecting dysphonia. Additionally, when using CPPS or SampEn there is an advantage of using the measures’ distributions rather than their average values.

  • 29.
    Selamtzis, Andreas
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Castellana, Antonella
    Department of Electronics and Telecommunications, Politecnico di Torino, Italy.
    Salvi, Giampiero
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Carullo, Alessio
    Department of Electronics and Telecommunications, Politecnico di Torino, Italy.
    Astolfi, Arianna
    Department of Electronics and Telecommunications, Politecnico di Torino, Italy.
    Effect of vowel context in cepstral and entropy analysis of pathological voices2019In: Biomedical Signal Processing and Control, ISSN 1746-8094, E-ISSN 1746-8108, Vol. 47, p. 350-357Article in journal (Refereed)
    Abstract [en]

    This study investigates the effect of vowel context (excerpted from speech versus sustained) on two voice quality measures: the cepstral peak prominence smoothed (CPPS) and sample entropy (SampEn). Thirty-one dysphonic subjects with different types of organic dysphonia and thirty-one controls read a phonetically balanced text and phonated sustained [a:] vowels in comfortable pitch and loudness. All the [a:] vowels of the read text were excerpted by automatic speech recognition and phonetic (forced) alignment. CPPS and SampEn were calculated for all excerpted vowels of each subject, forming one distribution of CPPS and SampEn values per subject. The sustained vowels were analyzed using a 41 ms window, forming another distribution of CPPS and SampEn values per subject. Two speech-language pathologists performed a perceptual evaluation of the dysphonic subjects’ voice quality from the recorded text. The power of discriminating the dysphonic group from the controls for SampEn and CPPS was assessed for the excerpted and sustained vowels with the Receiver-Operator Characteristic (ROC) analysis. The best discrimination in terms of Area Under Curve (AUC) for CPPS occurred using the mean of the excerpted vowel distributions (AUC=0.85) and for SampEn using the 95th percentile of the sustained vowel distributions (AUC=0.84). CPPS and SampEn were found to be negatively correlated, and the largest correlation was found between the corresponding 95th percentiles of their distributions (Pearson, r=−0.83, p < 10−3). A strong correlation was also found between the 95th percentile of SampEn distributions and the perceptual quality of breathiness (Pearson, r=0.83, p < 10−3). The results suggest that depending on the acoustic voice quality measure, sustained vowels can be more effective than excerpted vowels for detecting dysphonia. Additionally, when using CPPS or SampEn there is an advantage of using the measures’ distributions rather than their average values.

  • 30.
    Selamtzis, Andreas
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Ternström, Sten
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Richter, Bernard
    Burk, Fabian
    Köberlein, Maria
    Echternach, Matthias
    A comparison of electroglottographic and glottal area waveforms for phonation type differentiation in male professional singers2018Manuscript (preprint) (Other academic)
    Abstract [en]

    This study investigates the use of glottographic signals (EGG and GAW) to study phonation in different vibratory states as produced by professionally trained singers. Six western classical tenors were asked to phonate pitch glides from modal to falsetto phonation, or modal to their stage voice above the passaggio (SVaP). For each pitch glide the sample entropy (SampEn) of the EGG signal was calculated to establish a “ground truth” for the performed phonation type; the cycles before the maximum SampEn peak were labeled as modal, and the cycles after the peak as falsetto, or SVaP. Three classifications of vibratory state were performed using clustering: one based only on the EGG, one based on the GAW, and one based on their combi- nation. The classification error rate (clustering vs ground truth) was on average smaller than 10%, for any of the three settings, revealing no special advantage of the GAW over EGG, and vice versa. The EGG-based time domain metric analysis revealed a larger contact quotient and larger normalized EGG derivative peak ratio in modal, compared to SVaP and falsetto. The glottographic waveform comparison of SVaP with falsetto and modal suggests that SVaP resembles more falsetto than modal, though with a larger contact quotient. 

  • 31.
    Shore, Todd
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Androulakaki, Theofronia
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    KTH Tangrams: A Dataset for Research on Alignment and Conceptual Pacts in Task-Oriented Dialogue2018Conference paper (Refereed)
    Abstract [en]

    There is a growing body of research focused on task-oriented instructor-manipulator dialogue, whereby one dialogue participant initiates a reference to an entity in a common environment while the other participant must resolve this reference in order to manipulate said entity. Many of these works are based on disparate if nevertheless similar datasets. This paper described an English corpus of referring expressions in relatively free, unrestricted dialogue with physical features generated in a simulation, which facilitate analysis of dialogic linguistic phenomena regarding alignment in the formation of referring expressions known as conceptual pacts.

  • 32.
    Sibirtseva, Elena
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Nykvist, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Karaoguz, Hakan
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Leite, Iolanda
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kragic, Danica
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    A Comparison of Visualisation Methods for Disambiguating Verbal Requests in Human-Robot Interaction2018In: 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN) 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), 2018Conference paper (Refereed)
    Abstract [en]

    Picking up objects requested by a human user is a common task in human-robot interaction. When multiple objects match the user's verbal description, the robot needs to clarify which object the user is referring to before executing the action. Previous research has focused on perceiving user's multimodal behaviour to complement verbal commands or minimising the number of follow up questions to reduce task time. In this paper, we propose a system for reference disambiguation based on visualisation and compare three methods to disambiguate natural language instructions. In a controlled experiment with a YuMi robot, we investigated realtime augmentations of the workspace in three conditions - head-mounted display, projector, and a monitor as the baseline - using objective measures such as time and accuracy, and subjective measures like engagement, immersion, and display interference. Significant differences were found in accuracy and engagement between the conditions, but no differences were found in task time. Despite the higher error rates in the head-mounted display condition, participants found that modality more engaging than the other two, but overall showed preference for the projector condition over the monitor and head-mounted display conditions.

  • 33.
    Stefanov, Kalin
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Recognition and Generation of Communicative Signals: Modeling of Hand Gestures, Speech Activity and Eye-Gaze in Human-Machine Interaction2018Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Nonverbal communication is essential for natural and effective face-to-face human-human interaction. It is the process of communicating through sending and receiving wordless (mostly visual, but also auditory) signals between people. Consequently, a natural and effective face-to-face human-machine interaction requires machines (e.g., robots) to understand and produce such human-like signals. There are many types of nonverbal signals used in this form of communication including, body postures, hand gestures, facial expressions, eye movements, touches and uses of space. This thesis investigates two of these nonverbal signals: hand gestures and eye-gaze. The main goal of the thesis is to propose computational methods for real-time recognition and generation of these two signals in order to facilitate natural and effective human-machine interaction.

    The first topic addressed in the thesis is the real-time recognition of hand gestures and its application to recognition of isolated sign language signs. Hand gestures can also provide important cues during human-robot interaction, for example, emblems are type of hand gestures with specific meaning used to substitute spoken words. The thesis has two main contributions with respect to the recognition of hand gestures: 1) a newly collected dataset of isolated Swedish Sign Language signs, and 2) a real-time hand gestures recognition method.

    The second topic addressed in the thesis is the general problem of real-time speech activity detection in noisy and dynamic environments and its application to socially-aware language acquisition. Speech activity can also provide important information during human-robot interaction, for example, the current active speaker's hand gestures and eye-gaze direction or head orientation can play an important role in understanding the state of the interaction. The thesis has one main contribution with respect to speech activity detection: a real-time vision-based speech activity detection method.

    The third topic addressed in the thesis is the real-time generation of eye-gaze direction or head orientation and its application to human-robot interaction. Eye-gaze direction or head orientation can provide important cues during human-robot interaction, for example, it can regulate who is allowed to speak when and coordinate the changes in the roles on the conversational floor (e.g., speaker, addressee, and bystander). The thesis has two main contributions with respect to the generation of eye-gaze direction or head orientation: 1) a newly collected dataset of face-to-face interactions, and 2) a real-time eye-gaze direction or head orientation generation method.

  • 34.
    Stefanov, Kalin
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Salvi, Giampiero
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Self-Supervised Vision-Based Detection of the Active Speaker as a Prerequisite for Socially-Aware Language AcquisitionManuscript (preprint) (Other academic)
  • 35.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    What do these 5,599,881 parameters mean?: An analysis of a specific LSTM music transcription model, starting with the 70,281 parameters of its softmax layer2018In: Proceedings of the 6th International Workshop on Musical Metacreation (MUME 2018), 2018Conference paper (Refereed)
    Abstract [en]

    A folk-rnn model is a long short-term memory network (LSTM) that generates music transcriptions. We have evaluated these models in a variety of ways – from statistical analyses of generated transcriptions, to their use in music practice – but have yet to understand how their behaviours precipitate from their parameters. This knowledge is essential for improving such models, calibrating them, and broadening their applicability. In this paper, we analyse the parameters of the softmax output layer of a specific model realisation. We discover some key aspects of the model’s local and global behaviours, for instance, that its ability to construct a melody is highly reliant on a few symbols. We also derive a way to adjust the output of the last hidden layer of the model to attenuate its probability of producing specific outputs.

  • 36.
    Sturm, Bob
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Ben-Tal, Oded
    Kingston University, UK.
    Let’s Have Another Gan Ainm: An experimental album of Irish traditional music and computer-generated tunes2018Report (Other academic)
    Abstract [en]

    This technical report details the creation and public release of an album of folk music, most which comes from material generated by computer models trained on transcriptions of traditional music of Ireland and the UK.For each computer-generated tune appearing on the album, we provide below the original version and the alterations made.

  • 37.
    Sturm, Bob
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Ben-Tal, Oded
    Kingston University, UK.
    Monaghan, Úna
    Cambridge University, UK.
    Collins, Nick
    Durham University, UK.
    Herremans, Dorien
    University of Technology and Design, Singapore.
    Chew, Elaine
    Queen Mary University of London, UK.
    Hadjeres, Gäetan
    Sony CSL, Paris.
    Deruty, Emmanuel
    Sony CSL, Paris.
    Pachet, François
    Spotify, Paris.
    Machine Learning Research that Matters for Music Creation: A Case StudyIn: Journal of New Music Research, ISSN 0929-8215, E-ISSN 1744-5027Article in journal (Refereed)
    Abstract [en]

    Research applying machine learning to music modeling and generation typically proposes model architectures, training methods and datasets, and gauges system performance using quantitative measures like sequence likelihoods and/or qualitative listening tests. Rarely does such work explicitly question and analyse its usefulness for and impact on real-world practitioners, and then build on those outcomes to inform the development and application of machine learning. This article attempts to do these things for machine learning applied to music creation. Together with practitioners, we develop and use several applications of machine learning for music creation, and present a public concert of the results. We reflect on the entire experience to arrive at several ways of advancing these and similar applications of machine learning to music creation.

  • 38.
    Sundberg, Johan
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Flow Glottogram and Subglottal Pressure Relationship in Singers and Untrained Voices2018In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588, Vol. 32, no 1, p. 23-31Article in journal (Refereed)
    Abstract [en]

    This article combines results from three earlier investigations of the glottal voice source during phonation at varying degrees of vocal loudness (1) in five classically trained baritone singers (Sundberg et al., 1999), (2) in 15 female and 14 male untrained voices (Sundberg et al., 2005), and (3) in voices rated as hyperfunctional by an expert panel (Millgard et al., 2015). Voice source data were obtained by inverse filtering. Associated subglottal pressures were estimated from oral pressure during the occlusion for the consonant /p/. Five flow glottogram parameters, (1) maximum flow declination rate (MFDR), (2) peak-to-peak pulse amplitude, (3) level difference between the first and the second harmonics of the voice source, (4) closed quotient, and (5) normalized amplitude quotient, were averaged across the singer subjects and related to associated MFDR values. Strong, quantitative relations, expressed as equations, are found between subglottal pressure and MFDR and between MFDR and each of the other flow glottogram parameters. The values for the untrained voices, as well as those for the voices rated as hyperfunctional, deviate systematically from the values derived from the equations.

  • 39.
    Ternström, Sten
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    D'Amario, Sara
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. University of York.
    Selamtzis, Andreas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Effects of the lung volume on the electroglottographic waveform in trained female singers2018In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588Article in journal (Refereed)
    Abstract [en]

    Objectives: To determine if in singing there is an effect of lung volume on the electroglottographic waveform, and if so, how it varies over the voice range. Study design: Eight trained female singers sang the tune “Frère Jacques” in 18 conditions: three phonetic contexts, three dynamic levels, and high or low lung volume. Conditions were randomized and replicated. Methods: The audio and EGG signals were recorded in synchrony with signals tracking respiration and vertical larynx position. The first 10 Fourier descriptors of every EGG cycle were computed. These spectral data were clustered statistically, and the clusters were mapped by color into a voice range profile display, thus visualizing the EGG waveform changes under the influence of fo and SPL. The rank correlations and effect sizes of the relationships between relative lung volume and several adduction-related EGG wave shape metrics were similarly rendered on a color scale, in voice range profile-style ʻvoice maps.ʼ Results: In most subjects, EGG waveforms varied considerably over the voice range. Within subjects, reproducibility was high, not only across the replications, but also across the phonetic contexts. The EGG waveforms were quite individual, as was the nature of the EGG shape variation across the range. EGG metrics were significantly correlated to changes in lung volume, in parts of the range of the song, and in most subjects. However, the effect sizes of the relative lung volume were generally much smaller than the effects of fo and SPL, and the relationships always varied, even changing polarity from one part of the range to another. Conclusions: Most subjects exhibited small, reproducible effects of the relative lung volume on the EGG waveform. Some hypothesized influences of tracheal pull were seen, mostly at the lowest SPLs. The effects were however highly variable, both across the moderately wide fo-SPL range and across subjects. Different singers may be applying different techniques and compensatory behaviors with changing lung volume. The outcomes emphasize the importance of making observations over a substantial part of the voice range, and not only of phonations sustained at a few fundamental frequencies and sound levels.

  • 40.
    Ternström, Sten
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Music Acoustics.
    Nordmark, Jan
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.
    Intonation preferences for major thirds with non-beating ensemble sounds1996In: Proc. of Nordic Acoustical Meeting: NAM'96, Helsinki, 1996, p. 359-365, article id F2Conference paper (Refereed)
    Abstract [en]

    The frequency ratios, or intervals, of the twelve-tone scale can be mathematically dejned in several slightly diferent ways, each of which may be more or less appropriate in different musical contexts. For maximum mobility in musical key, instruments of our time with fixed tuning are typically tuned in equal temperament, except for performances of early music or avant-garde contemporary music. Some contend that pure intonation, being free of beats, is more natural, and would be preferred in instruments with variable tuning. The sound of choirs is such that beats are very unlikely to serve as cues for intonation. Choral performers have access to variable tuning, yet have not been shown to prefer pure intonation. The difference between alternative intonation schemes is largest for the major third interval. Choral directors and other musically expert subjects were asked to adjust to their preference the intonation of 20 major third intervals in synthetic ensemble sounds. The preferred size of the major third was 395.4 cents, with intra-subject averages ranging from 388 to 407 cents.

  • 41.
    Vijayan, Aravind Elanjimattathil
    et al.
    KTH. KTH Royal Inst Technol, Sch Elect Engn & Comp Sci, Stockholm, Sweden..
    Alexanderson, Simon
    KTH. KTH Royal Inst Technol, Sch Elect Engn & Comp Sci, Stockholm, Sweden..
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH Royal Inst Technol, Sch Elect Engn & Comp Sci, Stockholm, Sweden..
    Leite, Iolanda
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL. KTH Royal Inst Technol, Sch Elect Engn & Comp Sci, Stockholm, Sweden..
    Using Constrained Optimization for Real-Time Synchronization of Verbal and Nonverbal Robot Behavior2018In: 2018 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), IEEE Computer Society, 2018, p. 1955-1961Conference paper (Refereed)
    Abstract [en]

    Most of the motion re-targeting techniques are grounded on virtual character animation research, which means that they typically assume that the target embodiment has unconstrained joint angular velocities. However, because robots often do have such constraints, traditional re-targeting approaches can originate irregular delays in the robot motion. With the goal of ensuring synchronization between verbal and nonverbal behavior, this paper proposes an optimization framework for processing re-targeted motion sequences that addresses constraints such as joint angle and angular velocities. The proposed framework was evaluated on a humanoid robot using both objective and subjective metrics. While the analysis of the joint motion trajectories provides evidence that our framework successfully performs the desired modifications to ensure verbal and nonverbal behavior synchronization, results from a perceptual study showed that participants found the robot motion generated by our method more natural, elegant and lifelike than a control condition.

  • 42. Wistbacka, Greta
    et al.
    Andrade, Pedro Amarante
    Simberg, Susanna
    Hammarberg, Britta
    Sodersten, Maria
    Svec, Jan G.
    Granqvist, Svante
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Resonance Tube Phonation in Water-the Effect of Tube Diameter and Water Depth on Back Pressure and Bubble Characteristics at Different Airflows2018In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588, Vol. 32, no 1, article id UNSP 126.e11Article in journal (Refereed)
    Abstract [en]

    Objectives:. Resonance tube phonation with tube end in water is a voice therapy method in which the patient phonates through a glass tube, keeping the free end of the tube submerged in water, creating bubbles. The purpose of this experimental study was to determine flow-pressure relationship, flow thresholds between bubble types, and bubble frequency as a function of flow and back volume. Methods. A flow-driven vocal tract simulator was used for recording the back pressure produced by resonance tubes with inner diameters of 8 and 9 mm submerged at water depths of 0-7 cm. Visual inspection of bubble types through video recording was also performed. Results. The static back pressure was largely determined by the water depth. The narrower tube provided a slightly higher back pressure for a given flow and depth. The amplitude of the pressure oscillations increased with flow and depth. Depending on flow, the bubbles were emitted from the tube in three distinct types with increasing flow: one by one, pairwise, and in a chaotic manner. The bubble frequency was slightly higher for the narrower tube. An increase in back volume led to a decrease in bubble frequency. Conclusions. This study provides data on the physical properties of resonance tube phonation with the tube end in water. This information will be useful in future research when looking into the possible effects of this type of voice training.

1 - 42 of 42
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf