Change search
Refine search result
12 1 - 50 of 62
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1. Ambrazaitis, G.
    et al.
    House, David
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Multimodal prominences: Exploring the patterning and usage of focal pitch accents, head beats and eyebrow beats in Swedish television news readings2017In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 95, p. 100-113Article in journal (Refereed)
    Abstract [en]

    Facial beat gestures align with pitch accents in speech, functioning as visual prominence markers. However, it is not yet well understood whether and how gestures and pitch accents might be combined to create different types of multimodal prominence, and how specifically visual prominence cues are used in spoken communication. In this study, we explore the use and possible interaction of eyebrow (EB) and head (HB) beats with so-called focal pitch accents (FA) in a corpus of 31 brief news readings from Swedish television (four news anchors, 986 words in total), focusing on effects of position in text, information structure as well as speaker expressivity. Results reveal an inventory of four primary (combinations of) prominence markers in the corpus: FA+HB+EB, FA+HB, FA only (i.e., no gesture), and HB only, implying that eyebrow beats tend to occur only in combination with the other two markers. In addition, head beats occur significantly more frequently in the second than in the first part of a news reading. A functional analysis of the data suggests that the distribution of head beats might to some degree be governed by information structure, as the text-initial clause often defines a common ground or presents the theme of the news story. In the rheme part of the news story, FA, HB, and FA+HB are all common prominence markers. The choice between them is subject to variation which we suggest might represent a degree of freedom for the speaker to use the markers expressively. A second main observation concerns eyebrow beats, which seem to be used mainly as a kind of intensification marker for highlighting not only contrast, but also value, magnitude, or emotionally loaded words; it is applicable in any position in a text. We thus observe largely different patterns of occurrence and usage of head beats on the one hand and eyebrow beats on the other, suggesting that the two represent two separate modalities of visual prominence cuing.

  • 2.
    Bisesi, Erica
    et al.
    Centre for Systematic Musicology, University of Graz, Graz, Austria.
    Friberg, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Parncutt, Richard
    Centre for Systematic Musicology, University of Graz, Graz, Austria.
    A Computational Model of Immanent Accent Salience in Tonal Music2019In: Frontiers in Psychology, ISSN 1664-1078, E-ISSN 1664-1078, Vol. 10, no 317, p. 1-19Article in journal (Refereed)
    Abstract [en]

    Accents are local musical events that attract the attention of the listener, and can be either immanent (evident from the score) or performed (added by the performer). Immanent accents involve temporal grouping (phrasing), meter, melody, and harmony; performed accents involve changes in timing, dynamics, articulation, and timbre. In the past, grouping, metrical and melodic accents were investigated in the context of expressive music performance. We present a novel computational model of immanent accent salience in tonal music that automatically predicts the positions and saliences of metrical, melodic and harmonic accents. The model extends previous research by improving on preliminary formulations of metrical and melodic accents and introducing a new model for harmonic accents that combines harmonic dissonance and harmonic surprise. In an analysis-by-synthesis approach, model predictions were compared with data from two experiments, respectively involving 239 sonorities and 638 sonorities, and 16 musicians and 5 experts in music theory. Average pair-wise correlations between raters were lower for metrical (0.27) and melodic accents (0.37) than for harmonic accents (0.49). In both experiments, when combining all the raters into a single measure expressing their consensus, correlations between ratings and model predictions ranged from 0.43 to 0.62. When different accent categories of accents were combined together, correlations were higher than for separate categories (r = 0.66). This suggests that raters might use strategies different from individual metrical, melodic or harmonic accent models to mark the musical events.

  • 3. Borin, L.
    et al.
    Forsberg, M.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Domeij, R.
    Språkbanken 2018: Research resources for text, speech, & society2018In: CEUR Workshop Proceedings, CEUR-WS , 2018, p. 504-506Conference paper (Refereed)
    Abstract [en]

    We introduce an expanded version of the Swedish research resource Språkbanken (the Swedish Language Bank). In 2018, Språkbanken, which has supported national and international research for over four decades, adds two branches, one focusing on speech and one on societal aspects of language, to its existing organization, which targets text. 

  • 4.
    Bresin, Roberto
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Friberg, Anders
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Dahl, Sofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Toward a new model for sound control2001In: Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-01), Limerick, Ireland, December 6-8, 200 / [ed] Fernström, M., Brazil, E., & Marshall, M., 2001, p. 45-49Conference paper (Refereed)
    Abstract [en]

    The control of sound synthesis is a well-known problem. This is particularly true if the sounds are generated with physical modeling techniques that typically need specification of numerous control parameters. In the present work outcomes from studies on automatic music performance are used for tackling this problem. 

  • 5.
    Chettri, Bhusan
    et al.
    Queen Mary Univ London, Sch EECS, London, England..
    Mishra, Saumitra
    Queen Mary Univ London, Sch EECS, London, England..
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Queen Mary Univ London, Sch EECS, London, England..
    Analysing the predictions of a CNN-based replay spoofing detection system2018In: 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), IEEE , 2018, p. 92-97Conference paper (Refereed)
    Abstract [en]

    Playing recorded speech samples of an enrolled speaker – “replay attack” – is a simple approach to bypass an automatic speaker ver- ification (ASV) system. The vulnerability of ASV systems to such attacks has been acknowledged and studied, but there has been no research into what spoofing detection systems are actually learning to discriminate. In this paper, we analyse the local behaviour of a replay spoofing detection system based on convolutional neural net- works (CNNs) adapted from a state-of-the-art CNN (LC N NF F T ) submitted at the ASVspoof 2017 challenge. We generate tempo- ral and spectral explanations for predictions of the model using the SLIME algorithm. Our findings suggest that in most instances of spoofing the model is using information in the first 400 milliseconds of each audio instance to make the class prediction. Knowledge of the characteristics that spoofing detection systems are exploiting can help build less vulnerable ASV systems, other spoofing detection systems, as well as better evaluation databases.

  • 6.
    Chettri, Bhusan
    et al.
    Queen Mary Univ London, Sch EECS, London, England..
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Queen Mary Univ London, Sch EECS, London, England..
    Benetos, Emmanouil
    Queen Mary Univ London, Sch EECS, London, England..
    ANALYSING REPLAY SPOOFING COUNTERMEASURE PERFORMANCE UNDER VARIED CONDITIONS2018In: 2018 IEEE 28TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP) / [ed] Pustelnik, N Ma, Z Tan, ZH Larsen, J, IEEE , 2018Conference paper (Refereed)
    Abstract [en]

    In this paper, we aim to understand what makes replay spoofing detection difficult in the context of the ASVspoof 2017 corpus. We use FFT spectra, mel frequency cepstral coefficients (MFCC) and inverted MFCC (IMFCC) frontends and investigate different back-ends based on Convolutional Neural Networks (CNNs), Gaussian Mixture Models (GMMs) and Support Vector Machines (SVMs). On this database, we find that IMFCC frontend based systems show smaller equal error rate (EER) for high quality replay attacks but higher EER for low quality replay attacks in comparison to the baseline. However, we find that it is not straightforward to understand the influence of an acoustic environment (AE), a playback device (PD) and a recording device (RD) of a replay spoofing attack. One reason is the unavailability of metadata for genuine recordings. Second, it is difficult to account for the effects of the factors: AE, PD and RD, and their interactions. Finally, our frame-level analysis shows that the presence of cues (recording artefacts) in the first few frames of genuine signals (missing from replayed ones) influence class prediction.

  • 7.
    Dabbaghchian, Saeed
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Computational Modeling of the Vocal Tract: Applications to Speech Production2018Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Human speech production is a complex process, involving neuromuscular control signals, the effects of articulators' biomechanical properties and acoustic wave propagation in a vocal tract tube of intricate shape. Modeling these phenomena may play an important role in advancing our understanding of the involved mechanisms, and may also have future medical applications, e.g., guiding doctors in diagnosing, treatment planning, and surgery prediction of related disorders, ranging from oral cancer, cleft palate, obstructive sleep apnea, dysphagia, etc.

    A more complete understanding requires models that are as truthful representations as possible of the phenomena. Due to the complexity of such modeling, simplifications have nevertheless been used extensively in speech production research: phonetic descriptors (such as the position and degree of the most constricted part of the vocal tract) are used as control signals, the articulators are represented as two-dimensional geometrical models, the vocal tract is considered as a smooth tube and plane wave propagation is assumed, etc.

    This thesis aims at firstly investigating the consequences of such simplifications, and secondly at contributing to establishing unified modeling of the speech production process, by connecting three-dimensional biomechanical modeling of the upper airway with three-dimensional acoustic simulations. The investigation on simplifying assumptions demonstrated the influence of vocal tract geometry features — such as shape representation, bending and lip shape — on its acoustic characteristics, and that the type of modeling — geometrical or biomechanical — affects the spatial trajectories of the articulators, as well as the transition of formant frequencies in the spectrogram.

    The unification of biomechanical and acoustic modeling in three-dimensions allows to realistically control the acoustic output of dynamic sounds, such as vowel-vowel utterances, by contraction of relevant muscles. This moves and shapes the speech articulators that in turn dene the vocal tract tube in which the wave propagation occurs. The main contribution of the thesis in this line of work is a novel and complex method that automatically reconstructs the shape of the vocal tract from the biomechanical model. This step is essential to link biomechanical and acoustic simulations, since the vocal tract, which anatomically is a cavity enclosed by different structures, is only implicitly defined in a biomechanical model constituted of several distinct articulators.

  • 8.
    Dabbaghchian, Saeed
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Arnela, Marc
    GTM Grup de recerca en Tecnologies Mèdia, La Salle, Universitat Ramon Llull, Barcelona, Spain.
    Engwall, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Guasch, Oriol
    GTM Grup de recerca en Tecnologies Mèdia, La Salle, Universitat Ramon Llull, Barcelona, Spain.
    Reconstruction of vocal tract geometries from biomechanical simulations2018In: International Journal for Numerical Methods in Biomedical Engineering, ISSN 2040-7939, E-ISSN 2040-7947Article in journal (Refereed)
    Abstract [en]

    Medical imaging techniques are usually utilized to acquire the vocal tract geometry in 3D, which may then be used, eg, for acoustic/fluid simulation. As an alternative, such a geometry may also be acquired from a biomechanical simulation, which allows to alter the anatomy and/or articulation to study a variety of configurations. In a biomechanical model, each physical structure is described by its geometry and its properties (such as mass, stiffness, and muscles). In such a model, the vocal tract itself does not have an explicit representation, since it is a cavity rather than a physical structure. Instead, its geometry is defined implicitly by all the structures surrounding the cavity, and such an implicit representation may not be suitable for visualization or for acoustic/fluid simulation. In this work, we propose a method to reconstruct the vocal tract geometry at each time step during the biomechanical simulation. Complexity of the problem, which arises from model alignment artifacts, is addressed by the proposed method. In addition to the main cavity, other small cavities, including the piriform fossa, the sublingual cavity, and the interdental space, can be reconstructed. These cavities may appear or disappear by the position of the larynx, the mandible, and the tongue. To illustrate our method, various static and temporal geometries of the vocal tract are reconstructed and visualized. As a proof of concept, the reconstructed geometries of three cardinal vowels are further used in an acoustic simulation, and the corresponding transfer functions are derived.

  • 9.
    Dabbaghchian, Saeed
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Arnela, Marc
    Engwall, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Guasch, Oriol
    Synthesis of vowels and vowel-vowel utterancesusing a 3D biomechanical-acoustic model2018In: IEEE/ACM Transactions on Audio, Speech, and Language Processing, ISSN 2329-9290Article in journal (Refereed)
    Abstract [en]

    A link is established between a 3D biomechanicaland acoustic model allowing for the umerical synthesis of vowelsounds by contraction of the relevant muscles. That is, thecontraction of muscles in the biomechanical model displacesand deforms the articulators, which in turn deform the vocaltract shape. The mixed wave equation for the acoustic pressureand particle velocity is formulated in an arbitrary Lagrangian-Eulerian framework to account for moving boundaries. Theequations are solved numerically using the finite element method.Since the activation of muscles are not fully known for a givenvowel sound, an inverse method is employed to calculate aplausible activation pattern. For vowel-vowel utterances, two different approaches are utilized: linear interpolation in eithermuscle activation or geometrical space. Although the former isthe natural choice for biomechanical modeling, the latter is usedto investigate the contribution of biomechanical modeling onspeech acoustics. Six vowels [ɑ, ə, ɛ, e, i, ɯ] and three vowel-vowelutterances [ɑi, ɑɯ, ɯi] are synthesized using the 3D model. Results,including articulation, formants, and spectrogram of vowelvowelsounds, are in agreement with previous studies.Comparingthe spectrogram of interpolation in muscle and geometrical spacereveals differences in all frequencies, with the most extendeddifference in the second formant transition.

  • 10.
    Dabbaghchian, Saeed
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Nilsson, Isak
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Engwall, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    From Tongue Movement Data to Muscle Activation – A Preliminary Study of Artisynth's Inverse Modelling2014Conference paper (Other academic)
    Abstract [en]

    Finding the muscle activations during speech production is an important part of developing a comprehensive biomechanical model of speech production. Although there are some direct ways, like Electromyography, for measuring muscle activations, these methods usually are highly invasive and sometimes not reliable. They are more over impossible to use for all muscles. In this study we therefore explore an indirect way to estimate tongue muscle activations during speech production by combining Electromagnetic Articulography (EMA) measurements of tongue movements and the inverse modeling in Artisynth. With EMA we measure the time-changing 3D positions of four sensors attached to the tongue surface for a Swedish female subject producing vowel-vowel and vowelconsonant-vowel (VCV) sequences. The measured sensor positions are used as target points for corresponding virtual sensors introduced in the tongue model of Artisynth’s inverse modelling framework, which computes one possible combination of muscle activations that results in the observed sequence of tongue articulations. We present resynthesized tongue movements in the Artisynth model and verify the results by comparing the calculated muscle activations with literature.

  • 11.
    Elowsson, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Deep Layered Learning in MIRManuscript (preprint) (Other academic)
    Abstract [en]

    Deep learning has boosted the performance of many music information retrieval (MIR) systems in recent years. Yet, the complex hierarchical arrangement of music makes end-to-end learning hard for some MIR tasks – a very deep and structurally flexible processing chain is necessary to extract high-level features from a spectrogram representation. Mid-level representations such as tones, pitched onsets, chords, and beats are fundamental building blocks of music. This paper discusses how these can be used as intermediate representations in MIR to facilitate deep processing that generalizes well: each music concept is predicted individually in learning modules that are connected through latent representations in a directed acyclic graph. It is suggested that this strategy for inference, defined as deep layered learning (DLL), can help generalization by (1) – enforcing the validity of intermediate representations during processing, and by (2) – letting the inferred representations establish disentangled structures that support high-level invariant processing. A background to DLL and modular music processing is provided, and relevant concepts such as pruning, skip connections, and layered performance supervision are reviewed.

  • 12.
    Elowsson, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.
    Modeling Music: Studies of Music Transcription, Music Perception and Music Production2018Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    This dissertation presents ten studies focusing on three important subfields of music information retrieval (MIR): music transcription (Part A), music perception (Part B), and music production (Part C).

    In Part A, systems capable of transcribing rhythm and polyphonic pitch are described. The first two publications present methods for tempo estimation and beat tracking. A method is developed for computing the most salient periodicity (the “cepstroid”), and the computed cepstroid is used to guide the machine learning processing. The polyphonic pitch tracking system uses novel pitch-invariant and tone-shift-invariant processing techniques. Furthermore, the neural flux is introduced – a latent feature for onset and offset detection. The transcription systems use a layered learning technique with separate intermediate networks of varying depth.  Important music concepts are used as intermediate targets to create a processing chain with high generalization. State-of-the-art performance is reported for all tasks.

    Part B is devoted to perceptual features of music, which can be used as intermediate targets or as parameters for exploring fundamental music perception mechanisms. Systems are proposed that can predict the perceived speed and performed dynamics of an audio file with high accuracy, using the average ratings from around 20 listeners as ground truths. In Part C, aspects related to music production are explored. The first paper analyzes long-term average spectrum (LTAS) in popular music. A compact equation is derived to describe the mean LTAS of a large dataset, and the variation is visualized. Further analysis shows that the level of the percussion is an important factor for LTAS. The second paper examines songwriting and composition through the development of an algorithmic composer of popular music. Various factors relevant for writing good compositions are encoded, and a listening test employed that shows the validity of the proposed methods.

    The dissertation is concluded by Part D - Looking Back and Ahead, which acts as a discussion and provides a road-map for future work. The first paper discusses the deep layered learning (DLL) technique, outlining concepts and pointing out a direction for future MIR implementations. It is suggested that DLL can help generalization by enforcing the validity of intermediate representations, and by letting the inferred representations establish disentangled structures supporting high-level invariant processing. The second paper proposes an architecture for tempo-invariant processing of rhythm with convolutional neural networks. Log-frequency representations of rhythm-related activations are suggested at the main stage of processing. Methods relying on magnitude, relative phase, and raw phase information are described for a wide variety of rhythm processing tasks.

  • 13.
    Elowsson, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.
    Polyphonic Pitch Tracking with Deep Layered LearningManuscript (preprint) (Other academic)
    Abstract [en]

    This paper presents a polyphonic pitch tracking system able to extract both framewise and note-based estimates from audio. The system uses six artificial neural networks in a deep layered learning setup. First, cascading networks are applied to a spectrogram for framewise fundamental frequency (f0) estimation. A sparse receptive field is learned by the first network and then used for weight-sharing throughout the system. The f0 activations are connected across time to extract pitch ridges. These ridges define a framework, within which subsequent networks perform tone-shift-invariant onset and offset detection. The networks convolve the pitch ridges across time, using as input, e.g., variations of latent representations from the f0 estimation networks, defined as the “neural flux.” Finally, incorrect tentative notes are removed one by one in an iterative procedure that allows a network to classify notes within an accurate context. The system was evaluated on four public test sets: MAPS, Bach10, TRIOS, and the MIREX Woodwind quintet, and performed state-of-the-art results for all four datasets. It performs well across all subtasks: f0, pitched onset, and pitched offset tracking.

  • 14.
    Elowsson, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.
    Tempo-Invariant Processing of Rhythm with Convolutional Neural NetworksManuscript (preprint) (Other academic)
    Abstract [en]

    Rhythm patterns can be performed with a wide variation of tempi. This presents a challenge for many music information retrieval (MIR) systems; ideally, perceptually similar rhythms should be represented and processed similarly, regardless of the specific tempo at which they were performed. Several recent systems for tempo estimation, beat tracking, and downbeat tracking have therefore sought to process rhythm in a tempo-invariant way, often by sampling input vectors according to a precomputed pulse level. This paper describes how a log-frequency representation of rhythm-related activations instead can promote tempo invariance when processed with convolutional neural networks. The strategy incorporates invariance at a fundamental level and can be useful for most tasks related to rhythm processing. Different methods are described, relying on magnitude, phase relationships of different rhythm channels, as well as raw phase information. Several variations are explored to provide direction for future implementations.

  • 15.
    Fallgren, P.
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Malisz, Z.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    A tool for exploring large amounts of found audio data2018In: CEUR Workshop Proceedings, CEUR-WS , 2018, p. 499-503Conference paper (Refereed)
    Abstract [en]

    We demonstrate a method and a set of open source tools (beta) for nonsequential browsing of large amounts of audio data. The demonstration will contain versions of a set of functionalities in their first stages, and will provide a good insight in how the method can be used to browse through large quantities of audio data efficiently.

  • 16.
    Fermoselle, Leonor
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Designing joint attention systems for robots that assist children with autism spectrum disorders2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Joint attention behaviours play a central role in natural and believable human-robot interactions. This research presents the design decisions of a semi-autonomous joint attention robotic system, together with the evaluation of its effectiveness and perceived social presence across different cognitive ability groups. For this purpose, two different studies were carried out: first with adults, and then with children between 10 and 12 years-old.

    The overall results for both studies reflect a system that is perceived as socially present and engaging which can successfully establish joint attention with the participants. When comparing the performance results between the two groups, children achieved higher joint attention scores and reported a higher level of enjoyment and helpfulness in the interaction.

    Furthermore, a detailed literature review on robot-assisted therapies for children with autism spectrum disorders is presented, focusing on the development of joint attention skills. The children’s positive interaction results from the studies, together with state-of-the-art research therapies and the input from an autism therapist, guided the author to elaborate some design guidelines for a robotic system to assist in joint attention focused autism therapies.

  • 17. Finkel, Sebastian
    et al.
    Veit, Ralf
    Lotze, Martin
    Friberg, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Vuust, Peter
    Soekadar, Surjo
    Birbaumer, Niels
    Kleber, Boris
    Intermittent theta burst stimulation over right somatosensory larynx cortex enhances vocal pitch‐regulation in nonsingers2019In: Human Brain Mapping, ISSN 1065-9471, E-ISSN 1097-0193Article in journal (Refereed)
    Abstract [en]

    While the significance of auditory cortical regions for the development and maintenance of speech motor coordination is well established, the contribution of somatosensory brain areas to learned vocalizations such as singing is less well understood. To address these mechanisms, we applied intermittent theta burst stimulation (iTBS), a facilitatory repetitive transcranial magnetic stimulation (rTMS) protocol, over right somatosensory larynx cortex (S1) and a nonvocal dorsal S1 control area in participants without singing experience. A pitch‐matching singing task was performed before and after iTBS to assess corresponding effects on vocal pitch regulation. When participants could monitor auditory feedback from their own voice during singing (Experiment I), no difference in pitch‐matching performance was found between iTBS sessions. However, when auditory feedback was masked with noise (Experiment II), only larynx‐S1 iTBS enhanced pitch accuracy (50–250 ms after sound onset) and pitch stability (>250 ms after sound onset until the end). Results indicate that somatosensory feedback plays a dominant role in vocal pitch regulation when acoustic feedback is masked. The acoustic changes moreover suggest that right larynx‐S1 stimulation affected the preparation and involuntary regulation of vocal pitch accuracy, and that kinesthetic‐proprioceptive processes play a role in the voluntary control of pitch stability in nonsingers. Together, these data provide evidence for a causal involvement of right larynx‐S1 in vocal pitch regulation during singing.

  • 18.
    Friberg, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Commentary on Polak How short is the shortest metric subdivision?2017In: Empirical Musicology Review, ISSN 1559-5749, E-ISSN 1559-5749, Vol. 12, no 3-4, p. 227-228Article in journal (Other academic)
    Abstract [en]

    This commentary relates to the target paper by Polak on the shortest metric subdivision by presenting measurements on West-African drum music. It provides new evidence that the perceptual lower limit of tone duration is within the range 80-100 ms. Using fairly basic measurement techniques in combination with a musical analysis of the content, the original results in this study represents a valuable addition to the literature. Considering the relevance for music listening, further research would be valuable for determining and understanding the nature of this perceptual limit.

  • 19.
    Friberg, Anders
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Bisesi, Erica
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Inst Pasteur, France.
    Addessi, Anna Rita
    Univ Bologna, Dept Educ Studies, Bologna, Italy..
    Baroni, Mario
    Univ Bologna, Dept Arts, Bologna, Italy..
    Probing the Underlying Principles of Perceived Immanent Accents Using a Modeling Approach2019In: Frontiers in Psychology, ISSN 1664-1078, E-ISSN 1664-1078, Vol. 10, article id 1024Article in journal (Refereed)
    Abstract [en]

    This article deals with the question of how the perception of the "immanent accents" can be predicted and modeled. By immanent accent we mean any musical event in the score that is related to important points in the musical structure (e.g., tactus positions, melodic peaks) and is therefore able to capture the attention of a listener. Our aim was to investigate the underlying principles of these accented notes by combining quantitative modeling, music analysis and experimental methods. A listening experiment was conducted where 30 participants indicated perceived accented notes for 60 melodies, vocal and instrumental, selected from Baroque, Romantic and Posttonal styles. This produced a large and unique collection of perceptual data about the perceived immanent accents, organized by styles consisting of vocal and instrumental melodies within Western art music. The music analysis of the indicated accents provided a preliminary list of musical features that could be identified as possible reasons for the raters' perception of the immanent accents. These features related to the score in different ways, e.g., repeated fragments, single notes, or overall structure. A modeling approach was used to quantify the influence of feature groups related to pitch contour, tempo, timing, simple phrasing, and meter. A set of 43 computational features was defined from the music analysis and previous studies and extracted from the score representation. The mean ratings of the participants were predicted using multiple linear regression and support vector regression. The latter method (using cross-validation) obtained the best result of about 66% explained variance (r = 0.81) across all melodies and for a selected group of raters. The independent contribution of each feature group was relatively high for pitch contour and timing (9.6 and 7.0%). There were also significant contributions from tempo (4.5%), simple phrasing (4.4%), and meter (3.9%). Interestingly, the independent contribution varied greatly across participants, implying different listener strategies, and also some variability across different styles. The large differences among listeners emphasize the importance of considering the individual listener's perception in future research in music perception.

  • 20.
    Friberg, Anders
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Lindeberg, Tony
    KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
    Hellwagner, Martin
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Helgason, Pétur
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Salomão, Gláucia Laís
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Elovsson, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Lemaitre, Guillaume
    Institute for Research and Coordination in Acoustics and Music, Paris, France.
    Ternström, Sten
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Prediction of three articulatory categories in vocal sound imitations using models for auditory receptive fields2018In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 144, no 3, p. 1467-1483Article in journal (Refereed)
    Abstract [en]

    Vocal sound imitations provide a new challenge for understanding the coupling between articulatory mechanisms and the resulting audio. In this study, we have modeled the classification of three articulatory categories, phonation, supraglottal myoelastic vibrations, and turbulence from audio recordings. Two data sets were assembled, consisting of different vocal imitations by four professional imitators and four non-professional speakers in two different experiments. The audio data were manually annotated by two experienced phoneticians using a detailed articulatory description scheme. A separate set of audio features was developed specifically for each category using both time-domain and spectral methods. For all time-frequency transformations, and for some secondary processing, the recently developed Auditory Receptive Fields Toolbox was used. Three different machine learning methods were applied for predicting the final articulatory categories. The result with the best generalization was found using an ensemble of multilayer perceptrons. The cross-validated classification accuracy was 96.8 % for phonation, 90.8 % for supraglottal myoelastic vibrations, and 89.0 % for turbulence using all the 84 developed features. A final feature reduction to 22 features yielded similar results.

  • 21.
    Hallström, Eric
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Mossmyr, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Vegeborn, Victor
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Wedin, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    From Jigs and Reels to Schottisar och Polskor: Generating Scandinavian-like Folk Music with Deep Recurrent Networks2019Conference paper (Refereed)
    Abstract [en]

    The use of recurrent neural networks for modeling and generating music has been shown to be quite effective for compact, textual transcriptions of traditional music from Ireland and the UK. We explore how well these models perform for textual transcriptions of traditional music from Scandinavia. This type of music has characteristics that are similar to and different from that of Irish music, e.g., mode, rhythm, and structure. We investigate the effects of different architectures and training regimens, and evaluate the resulting models using three methods: a comparison of statistics between real and generated transcriptions, an appraisal of generated transcriptions via a semi-structured interview with an expert in Swedish folk music, and an ex- ercise conducted with students of Scandinavian folk music. We find that some of our models can generate new tran- scriptions sharing characteristics with Scandinavian folk music, but which often lack the simplicity of real transcrip- tions. One of our models has been implemented online at http://www.folkrnn.org for anyone to try.

  • 22.
    Holzapfel, André
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Media Technology and Interaction Design, MID.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Coeckelbergh, Mark
    Department of Philosophy, University of Vienna, Vienna, Austria.
    Ethical Dimensions of Music Information Retrieval Technology2018In: Transactions of the International Society for Music Information Retrieval, E-ISSN 2514-3298, Vol. 1, no 1, p. 44-55Article in journal (Refereed)
    Abstract [en]

    This article examines ethical dimensions of Music Information Retrieval (MIR) technology.  It uses practical ethics (especially computer ethics and engineering ethics) and socio-technical approaches to provide a theoretical basis that can inform discussions of ethics in MIR. To help ground the discussion, the article engages with concrete examples and discourse drawn from the MIR field. This article argues that MIR technology is not value-neutral but is influenced by design choices, and so has unintended and ethically relevant implications. These can be invisible unless one considers how the technology relates to wider society. The article points to the blurring of boundaries between music and technology, and frames music as “informationally enriched” and as a “total social fact.” The article calls attention to biases that are introduced by algorithms and data used for MIR technology, cultural issues related to copyright, and ethical problems in MIR as a scientific practice. The article concludes with tentative ethical guidelines for MIR developers, and calls for addressing key ethical problems with MIR technology and practice, especially those related to forms of bias and the remoteness of the technology development from end users.

  • 23.
    Jansson, Erik V.
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kabała, A.
    On the influence of arching and material on the vibration of a shell - Towards understanding the soloist violin2018In: Vibrations in Physical Systems, ISSN 0860-6897, Vol. 29, article id 2018027Article in journal (Refereed)
    Abstract [en]

    A study of the results of FEM simulations of plate and shell models are presented to reference of a violin vibrations problems. The influence of arching, variable thickness and damping was considered. ABAQUS/Explicit procedure of “Dynamic Explicit” was used in the simulation. Anisotropy in the material properties (spruce) was considered (9 elastic constants).

  • 24.
    Jonell, Patrik
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Mattias, Bystedt
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Per, Fallgren
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    David Aguas Lopes, José
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Samuel, Mascarenhas
    GAIPS INESC-ID, Lisbon, Portugal.
    Oertel, Catharine
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Eran, Raveh
    Multimodal Computing and Interaction, Saarland University, Germany.
    Shore, Todd
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    FARMI: A Framework for Recording Multi-Modal Interactions2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris: European Language Resources Association, 2018, p. 3969-3974Conference paper (Refereed)
    Abstract [en]

    In this paper we present (1) a processing architecture used to collect multi-modal sensor data, both for corpora collection and real-time processing, (2) an open-source implementation thereof and (3) a use-case where we deploy the architecture in a multi-party deception game, featuring six human players and one robot. The architecture is agnostic to the choice of hardware (e.g. microphones, cameras, etc.) and programming languages, although our implementation is mostly written in Python. In our use-case, different methods of capturing verbal and non-verbal cues from the participants were used. These were processed in real-time and used to inform the robot about the participants’ deceptive behaviour. The framework is of particular interest for researchers who are interested in the collection of multi-party, richly recorded corpora and the design of conversational systems. Moreover for researchers who are interested in human-robot interaction the available modules offer the possibility to easily create both autonomous and wizard-of-Oz interactions.

  • 25.
    Jonell, Patrik
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Oertel, Catharine
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Crowdsourced Multimodal Corpora Collection Tool2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 728-734Conference paper (Refereed)
    Abstract [en]

    In recent years, more and more multimodal corpora have been created. To our knowledge there is no publicly available tool which allows for acquiring controlled multimodal data of people in a rapid and scalable fashion. We therefore are proposing (1) a novel tool which will enable researchers to rapidly gather large amounts of multimodal data spanning a wide demographic range, and (2) an example of how we used this tool for corpus collection of our "Attentive listener'' multimodal corpus. The code is released under an Apache License 2.0 and available as an open-source repository, which can be found at https://github.com/kth-social-robotics/multimodal-crowdsourcing-tool. This tool will allow researchers to set-up their own multimodal data collection system quickly and create their own multimodal corpora. Finally, this paper provides a discussion about the advantages and disadvantages with a crowd-sourced data collection tool, especially in comparison to a lab recorded corpora.

  • 26.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    The Trade-off between Interaction Time and Social Facilitation with Collaborative Social Robots2019In: The Challenges of Working on Social Robots that Collaborate with People, 2019Conference paper (Refereed)
    Abstract [en]

    The adoption of social robots and conversational agents is growing at a rapid pace. These agents, however, are still not optimised to simulate key social aspects of situated human conversational environments. Humans are intellectually biased towards social activity when facing more anthropomorphic agents or when presented with subtle social cues. In this paper, we discuss the effects of simulating anthropomorphism and non-verbal social behaviour in social robots and its implications for human-robot collaborative guided tasks. Our results indicate that it is not always favourable for agents to be anthropomorphised or to communicate with nonverbal behaviour. We found a clear trade-off between interaction time and social facilitation when controlling for anthropomorphism and social behaviour.

  • 27.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Avramova, Vanya
    KTH.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Jonell, Patrik
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Oertel, Catharine
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 119-127Conference paper (Refereed)
    Abstract [en]

    In this paper we present a corpus of multiparty situated interaction where participants collaborated on moving virtual objects on a large touch screen. A moderator facilitated the discussion and directed the interaction. The corpus contains recordings of a variety of multimodal data, in that we captured speech, eye gaze and gesture data using a multisensory setup (wearable eye trackers, motion capture and audio/video). Furthermore, in the description of the multimodal corpus, we investigate four different types of social gaze: referential gaze, joint attention, mutual gaze and gaze aversion by both perspectives of a speaker and a listener. We annotated the groups’ object references during object manipulation tasks and analysed the group’s proportional referential eye-gaze with regards to the referent object. When investigating the distributions of gaze during and before referring expressions we could corroborate the differences in time between speakers’ and listeners’ eye gaze found in earlier studies. This corpus is of particular interest to researchers who are interested in social eye-gaze patterns in turn-taking and referring language in situated multi-party interaction.

  • 28.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Sibirtseva, Elena
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Pereira, André
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Multimodal reference resolution in collaborative assembly tasks2018In: Multimodal reference resolution in collaborative assembly tasks, ACM Digital Library, 2018Conference paper (Refereed)
  • 29.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    The Effects of Embodiment and Social Eye-Gaze in Conversational Agents2019In: Proceedings of the 41st Annual Conference of the Cognitive Science Society (CogSci), 2019Conference paper (Refereed)
    Abstract [en]

    The adoption of conversational agents is growing at a rapid pace. Agents however, are not optimised to simulate key social aspects of situated human conversational environments. Humans are intellectually biased towards social activity when facing more anthropomorphic agents or when presented with subtle social cues. In this work, we explore the effects of simulating anthropomorphism and social eye-gaze in three conversational agents. We tested whether subjects’ visual attention would be similar to agents in different forms of embodiment and social eye-gaze. In a within-subject situated interaction study (N=30), we asked subjects to engage in task-oriented dialogue with a smart speaker and two variations of a social robot. We observed shifting of interactive behaviour by human users, as shown in differences in behavioural and objective measures. With a trade-off in task performance, social facilitation is higher with more anthropomorphic social agents when performing the same task.

  • 30.
    Kragic, Danica
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Karaoǧuz, Hakan
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Jensfelt, Patric
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Krug, Robert
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Interactive, collaborative robots: Challenges and opportunities2018In: IJCAI International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence , 2018, p. 18-25Conference paper (Refereed)
    Abstract [en]

    Robotic technology has transformed manufacturing industry ever since the first industrial robot was put in use in the beginning of the 60s. The challenge of developing flexible solutions where production lines can be quickly re-planned, adapted and structured for new or slightly changed products is still an important open problem. Industrial robots today are still largely preprogrammed for their tasks, not able to detect errors in their own performance or to robustly interact with a complex environment and a human worker. The challenges are even more serious when it comes to various types of service robots. Full robot autonomy, including natural interaction, learning from and with human, safe and flexible performance for challenging tasks in unstructured environments will remain out of reach for the foreseeable future. In the envisioned future factory setups, home and office environments, humans and robots will share the same workspace and perform different object manipulation tasks in a collaborative manner. We discuss some of the major challenges of developing such systems and provide examples of the current state of the art.

  • 31.
    Kucherenko, Taras
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, Perception and Learning, RPL.
    Hasegawa, Dai
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kaneko, Naoshi
    Kjellström, Hedvig
    KTH, Superseded Departments (pre-2005), Numerical Analysis and Computer Science, NADA. KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, Perception and Learning, RPL.
    Analyzing Input and Output Representations for Speech-Driven Gesture Generation2019In: 19th ACM International Conference on Intelligent Virtual Agents, New York, NY, USA: ACM Publications, 2019Conference paper (Refereed)
    Abstract [en]

    This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates.

    Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences.

    We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.

  • 32.
    Kucherenko, Taras
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Hasegawa, Dai
    Hokkai Gakuen University, Sapporo, Japan.
    Naoshi, Kaneko
    Aoyama Gakuin University, Sagamihara, Japan.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kjellström, Hedvig
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    On the Importance of Representations for Speech-Driven Gesture Generation: Extended Abstract2019Conference paper (Refereed)
    Abstract [en]

    This paper presents a novel framework for automatic speech-driven gesture generation applicable to human-agent interaction, including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech features as input and produces gestures in the form of sequences of 3D joint coordinates representing motion as output. The results of objective and subjective evaluations confirm the benefits of the representation learning.

  • 33.
    Li, Chengjie
    et al.
    KTH.
    Androulakaki, Theofronia
    KTH.
    Gao, Alex Yuan
    Yang, Fangkai
    KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
    Saikia, Himangshu
    KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
    Peters, Christopher
    KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Effects of Posture and Embodiment on Social Distance in Human-Agent Interaction in Mixed Reality2018In: Proceedings of the 18th International Conference on Intelligent Virtual Agents, ACM Digital Library, 2018, p. 191-196Conference paper (Refereed)
    Abstract [en]

    Mixed reality offers new potentials for social interaction experiences with virtual agents. In addition, it can be used to experiment with the design of physical robots. However, while previous studies have investigated comfortable social distances between humans and artificial agents in real and virtual environments, there is little data with regards to mixed reality environments. In this paper, we conducted an experiment in which participants were asked to walk up to an agent to ask a question, in order to investigate the social distances maintained, as well as the subject's experience of the interaction. We manipulated both the embodiment of the agent (robot vs. human and virtual vs. physical) as well as closed vs. open posture of the agent. The virtual agent was displayed using a mixed reality headset. Our experiment involved 35 participants in a within-subject design. We show that, in the context of social interactions, mixed reality fares well against physical environments, and robots fare well against humans, barring a few technical challenges.

  • 34.
    Malisz, Zofia
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Żygis, M.
    Lexical stress in Polish: Evidence from focus and phrase-position differentiated production data2018In: Proceedings of the International Conference on Speech Prosody, International Speech Communications Association , 2018, p. 1008-1012Conference paper (Refereed)
    Abstract [en]

    We examine acoustic patterns of word stress in Polish in data with carefully separated phrase- and word-level prominences. We aim to verify claims in the literature regarding the phonetic and phonological status of lexical stress (both primary and secondary) in Polish and to contribute to a better understanding of prosodic prominence and boundary interactions. Our results show significant effects of primary stress on acoustic parameters such as duration, f0 functionals and spectral emphasis expected for a stress language. We do not find clear and systematic acoustic evidence for secondary stress.

  • 35.
    Näslund, Per
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Artificial Neural Networks in Swedish Speech Synthesis2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Text-to-speech (TTS) systems have entered our daily lives in the form of smart assistants and many other applications. Contemporary re- search applies machine learning and artificial neural networks (ANNs) to synthesize speech. It has been shown that these systems outperform the older concatenative and parametric methods.

    In this paper, ANN-based methods for speech synthesis are ex- plored and one of the methods is implemented for the Swedish lan- guage. The implemented method is dubbed “Tacotron” and is a first step towards end-to-end ANN-based TTS which puts many differ- ent ANN-techniques to work. The resulting system is compared to a parametric TTS through a strength-of-preference test that is carried out with 20 Swedish speaking subjects. A statistically significant pref- erence for the ANN-based TTS is found. Test subjects indicate that the ANN-based TTS performs better than the parametric TTS when it comes to audio quality and naturalness but sometimes lacks in intelli- gibility.

  • 36.
    Pabon, Peter
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Institute of Sonology, Royal Conservatory.
    Mapping Individual Voice Quality over the Voice Range: The Measurement Paradigm of the Voice Range Profile2018Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    The acoustic signal of voiced sounds has two primary attributes: fundamental frequency and sound level. It has also very many secondary attributes, or ‘voice qualities’, that can be derived from the acoustic signal, in particular from its periodicity and its spectrum. Acoustic voice analysis as a discipline is largely concerned with identifying and quantifying those qualities or parameters that are relevant for assessing the health or training status of a voice or that characterize the individual quality. The thesis presented here is that all such voice qualities covary essentially and individually with the fundamental frequency and the sound level, and that methods for assessing the voice must account for this covariation and individuality. The central interest in the "voice field" measurement paradigm becomes to map the proportional dependencies that exist between voice parameters. The five studies contribute to ways of doing this in practice, while the framework text presents the theoretical basis for the analysis model in relation to the practical principles.

  • 37.
    Pabon, Peter
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics. Royal Conservatoire, The Hague, Netherlands.
    Ternström, Sten
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.
    Feature maps of the acoustic spectrum of the voice2018In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588Article in journal (Refereed)
    Abstract [en]

    The change in the spectrum of sustained /a/ vowels was mapped over the voice range from low to high fundamentalfrequency and low to high sound pressure level (SPL), in the form of the so-called voice range profile (VRP). In eachinterval of one semitone and one decibel, narrowband spectra were averaged both within and across subjects. Thesubjects were groups of 7 male and 12 female singing students, as well as a group of 16 untrained female voices. Foreach individual and also for each group, pairs of VRP recordings were made, with stringent separation of themodal/chest and falsetto/head registers. Maps are presented of eight scalar metrics, each of which was chosen toquantify a particular feature of the voice spectrum, over fundamental frequency and SPL. Metrics 1 and 2 chart the roleof the fundamental in relation to the rest of the spectrum. Metrics 3 and 4 are used to explore the role of resonances inrelation to SPL. Metrics 5 and 6 address the distribution of high frequency energy, while metrics 7 and 8 seek todescribe the distribution of energy at the low end of the voice spectrum. Several examples are observed ofphenomena that are difficult to predict from linear source-filter theory, and of the voice source being less uniform overthe voice range than is conventionally assumed. These include a high-frequency band-limiting at high SPL and anunexpected persistence of the second harmonic at low SPL. The two voice registers give rise to clearly different maps.Only a few effects of training were observed, in the low frequency end below 2 kHz. The results are of potentialinterest in voice analysis, voice synthesis and for new insights into the voice production mechanism.

  • 38.
    Per, Fallgren
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Bringing order to chaos: A non-sequential approach for browsing large sets of found audio data2019In: LREC 2018 - 11th International Conference on Language Resources and Evaluation, European Language Resources Association (ELRA) , 2019, p. 4307-4311Conference paper (Refereed)
    Abstract [en]

    We present a novel and general approach for fast and efficient non-sequential browsing of sound in large archives that we know little or nothing about, e.g. so called found data - data not recorded with the specific purpose to be analysed or used as training data. Our main motivation is to address some of the problems speech and speech technology researchers see when they try to capitalise on the huge quantities of speech data that reside in public archives. Our method is a combination of audio browsing through massively multi-object sound environments and a well-known unsupervised dimensionality reduction algorithm (SOM). We test the process chain on four data sets of different nature (speech, speech and music, farm animals, and farm animals mixed with farm sounds). The methods are shown to combine well, resulting in rapid and readily interpretable observations. Finally, our initial results are demonstrated in prototype software which is freely available.

  • 39.
    Peters, Christopher
    et al.
    KTH.
    Li, Chengjie
    KTH.
    Yang, Fangkai
    KTH.
    Avramova, Vanya
    KTH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Investigating Social Distances between Humans, Virtual Humans and Virtual Robots in Mixed Reality2018In: Proceedings of 17th International Conference on Autonomous Agents and MultiAgent Systems, 2018, p. 2247-2249Conference paper (Refereed)
    Abstract [en]

    Mixed reality environments offer new potentials for the design of compelling social interaction experiences with virtual characters. In this paper, we summarise initial experiments we are conducting in which we measure comfortable social distances between humans, virtual humans and virtual robots in mixed reality environments. We consider a scenario in which participants walk within a comfortable distance of a virtual character that has its appearance varied between a male and female human, and a standard- and human-height virtual Pepper robot. Our studies in mixed reality thus far indicate that humans adopt social zones with artificial agents that are similar in manner to human-human social interactions and interactions in virtual reality.

  • 40. Roddy, M.
    et al.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Harte, N.
    Investigating speech features for continuous turn-taking prediction using LSTMs2018In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association, 2018, p. 586-590Conference paper (Refereed)
    Abstract [en]

    For spoken dialog systems to conduct fluid conversational interactions with users, the systems must be sensitive to turn-taking cues produced by a user. Models should be designed so that effective decisions can be made as to when it is appropriate, or not, for the system to speak. Traditional end-of-turn models, where decisions are made at utterance end-points, are limited in their ability to model fast turn-switches and overlap. A more flexible approach is to model turn-taking in a continuous manner using RNNs, where the system predicts speech probability scores for discrete frames within a future window. The continuous predictions represent generalized turn-taking behaviors observed in the training data and can be applied to make decisions that are not just limited to end-of-turn detection. In this paper, we investigate optimal speech-related feature sets for making predictions at pauses and overlaps in conversation. We find that while traditional acoustic features perform well, part-of-speech features generally perform worse than word features. We show that our current models outperform previously reported baselines.

  • 41.
    Rodríguez-Algarra, Francisco
    et al.
    Queen Mary University of London.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Dixon, Simon
    Queen Mary University of London.
    Characterising Confounding Effects in Music Classification Experiments through Interventions2019In: Transactions of the International Society for Music Information Retrieval, p. 52-66Article in journal (Refereed)
    Abstract [en]

    We address the problem of confounding in the design of music classification experiments, that is, the inability to distinguish the effects of multiple potential influencing variables in the measurements. Confounding affects the validity of conclusions at many levels, and so must be properly accounted for. We propose a procedure for characterising effects of confounding in the results of music classification experiments by creating regulated test conditions through interventions in the experimental pipeline, including a novel resampling strategy. We demonstrate this procedure on the GTZAN genre collection, which is known to give rise to confounding effects.

  • 42. Rossing, T D
    et al.
    Sundberg, Johan
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.
    Ternström, Sten
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Music Acoustics.
    Acoustic comparison of soprano solo and choir singing.1987In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 82, no 3, p. 830-836Article in journal (Refereed)
    Abstract [en]

    Five soprano singers were recorded while singing similar texts in both choir and solo modes of performance. A comparison of long-term-average spectra of similar passages in both modes indicates that subjects used different tactics to achieve somewhat higher concentrations of energy in the 2- to 4-kHz range when singing in the solo mode. It is likely that this effect resulted, at least in part, from a slight change of the voice source from choir to solo singing. The subjects used slightly more vibrato when singing in the solo mode.

  • 43. Rossing, T D
    et al.
    Sundberg, Johan
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.
    Ternström, Sten
    Acoustic comparison of voice use in solo and choir singing.1986In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 79, no 6, p. 1975-1981Article in journal (Refereed)
    Abstract [en]

    An experiment was carried out in which eight bass/baritone singers were recorded while singing in both choral and solo modes. Together with their own voice, they heard the sound of the rest of the choir and a piano accompaniment, respectively. The recordings were analyzed in several ways, including computation of long-time-average spectra for each passage, analysis of the sound levels in the frequency ranges corresponding to the fundamental and the "singer's formant," and a comparison of the sung levels with the levels heard by the singers. Matching pairs of vowels in the two modes were inverse filtered to determine the voice source spectra and formant frequencies for comparison. Differences in both phonation and articulation between the two modes were observed. Subjects generally sang with more power in the singer's formant region in the solo mode and with more power in the fundamental region in the choral mode. Most singers used a reduced frequency distance between the third and fifth formants for increasing the power in the singer's formant range, while the difference in the fundamental was mostly a voice source effect. In a choral singing mode, subjects usually adjusted their voice levels to the levels they heard from the other singers, whereas in a solo singing mode the level sung depended much less on the level of an accompaniment.

  • 44. Saponaro, Giovanni
    Bernardino, Alexandre (Contributor)
    Salvi, Giampiero (Contributor)
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beyond the Self: Using Grounded Affordances to Interpret and Describe Others’ Actions2019In: IEEE Transactions on Cognitive and Developmental Systems, ISSN 2379-8920Article in journal (Refereed)
    Abstract [en]

    We propose a developmental approach that allows a robot to interpret and describe the actions of human agents by reusing previous experience. The robot first learns the association between words and object affordances by manipulating the objects in its environment. It then uses this information to learn a mapping between its own actions and those performed by a human in a shared environment. It finally fuses the information from these two models to interpret and describe human actions in light of its own experience. In our experiments, we show that the model can be used flexibly to do inference on different aspects of the scene. We can predict the effects of an action on the basis of object properties. We can revise the belief that a certain action occurred, given the observed effects of the human action. In an early action recognition fashion, we can anticipate the effects when the action has only been partially observed. By estimating the probability of words given the evidence and feeding them into a pre-defined grammar, we can generate relevant descriptions of the scene. We believe that this is a step towards providing robots with the fundamental skills to engage in social collaboration with humans.

  • 45.
    Schoonderwaldt, Erwin
    et al.
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Friberg, Anders
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Bresin, Roberto
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Juslin, P. N.
    Uppsala University.
    A system for improving the communication of emotion in music performance by feedback learning2002In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 111, no 5, p. 2471-Article in journal (Refereed)
    Abstract [en]

    Expressivity is one of the most important aspects of music performance. However, in music education, expressivity is often overlooked in favor of technical abilities. This could possibly depend on the difficulty in describing expressivity, which makes it problematic to provide the student with specific feedback. The aim of this project is to develop a computer program, which will improve the students’ ability in communicating emotion in music performance. The expressive intention of a performer can be coded in terms of performance parameters (cues), such as tempo, sound level, timbre, and articulation. Listeners’ judgments can be analyzed in the same terms. An algorithm was developed for automatic cue extraction from audio signals. Using note onset–offset detection, the algorithm yields values of sound level, articulation, IOI, and onset velocity for each note. In previous research, Juslin has developed a method for quantitative evaluation of performer–listener communication. This framework forms the basis of the present program. Multiple regression analysis on performances of the same musical fragment, played with different intentions, determines the relative importance of each cue and the consistency of cue utilization. Comparison with built‐in listener models, simulating perceived expression using a regression equation, provides detailed feedback regarding the performers’ cue utilization.

  • 46.
    Selamtzis, Andreas
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Castellana, Antonella
    Department of Electronics and Telecommunications, Politecnico di Torino, Italy.
    Salvi, Giampiero
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Carullo, Alessio
    Department of Electronics and Telecommunications, Politecnico di Torino, Italy.
    Astolfi, Arianna
    Department of Electronics and Telecommunications, Politecnico di Torino, Italy.
    Effect of vowel context in cepstral and entropy analysis of pathological voices2019In: Biomedical Signal Processing and Control, ISSN 1746-8094, E-ISSN 1746-8108, Vol. 47, p. 350-357Article in journal (Refereed)
    Abstract [en]

    This study investigates the effect of vowel context (excerpted from speech versus sustained) on two voice quality measures: the cepstral peak prominence smoothed (CPPS) and sample entropy (SampEn). Thirty-one dysphonic subjects with different types of organic dysphonia and thirty-one controls read a phonetically balanced text and phonated sustained [a:] vowels in comfortable pitch and loudness. All the [a:] vowels of the read text were excerpted by automatic speech recognition and phonetic (forced) alignment. CPPS and SampEn were calculated for all excerpted vowels of each subject, forming one distribution of CPPS and SampEn values per subject. The sustained vowels were analyzed using a 41 ms window, forming another distribution of CPPS and SampEn values per subject. Two speech-language pathologists performed a perceptual evaluation of the dysphonic subjects’ voice quality from the recorded text. The power of discriminating the dysphonic group from the controls for SampEn and CPPS was assessed for the excerpted and sustained vowels with the Receiver-Operator Characteristic (ROC) analysis. The best discrimination in terms of Area Under Curve (AUC) for CPPS occurred using the mean of the excerpted vowel distributions (AUC=0.85) and for SampEn using the 95th percentile of the sustained vowel distributions (AUC=0.84). CPPS and SampEn were found to be negatively correlated, and the largest correlation was found between the corresponding 95th percentiles of their distributions (Pearson, r=−0.83, p < 10−3). A strong correlation was also found between the 95th percentile of SampEn distributions and the perceptual quality of breathiness (Pearson, r=0.83, p < 10−3). The results suggest that depending on the acoustic voice quality measure, sustained vowels can be more effective than excerpted vowels for detecting dysphonia. Additionally, when using CPPS or SampEn there is an advantage of using the measures’ distributions rather than their average values.

  • 47.
    Selamtzis, Andreas
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Ternström, Sten
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Richter, Bernard
    Burk, Fabian
    Köberlein, Maria
    Echternach, Matthias
    A comparison of electroglottographic and glottal area waveforms for phonation type differentiation in male professional singers2018Manuscript (preprint) (Other academic)
    Abstract [en]

    This study investigates the use of glottographic signals (EGG and GAW) to study phonation in different vibratory states as produced by professionally trained singers. Six western classical tenors were asked to phonate pitch glides from modal to falsetto phonation, or modal to their stage voice above the passaggio (SVaP). For each pitch glide the sample entropy (SampEn) of the EGG signal was calculated to establish a “ground truth” for the performed phonation type; the cycles before the maximum SampEn peak were labeled as modal, and the cycles after the peak as falsetto, or SVaP. Three classifications of vibratory state were performed using clustering: one based only on the EGG, one based on the GAW, and one based on their combi- nation. The classification error rate (clustering vs ground truth) was on average smaller than 10%, for any of the three settings, revealing no special advantage of the GAW over EGG, and vice versa. The EGG-based time domain metric analysis revealed a larger contact quotient and larger normalized EGG derivative peak ratio in modal, compared to SVaP and falsetto. The glottographic waveform comparison of SVaP with falsetto and modal suggests that SVaP resembles more falsetto than modal, though with a larger contact quotient. 

  • 48.
    Shore, Todd
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Androulakaki, Theofronia
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    KTH Tangrams: A Dataset for Research on Alignment and Conceptual Pacts in Task-Oriented Dialogue2019In: LREC 2018 - 11th International Conference on Language Resources and Evaluation, Tokyo, 2019, p. 768-775Conference paper (Refereed)
    Abstract [en]

    There is a growing body of research focused on task-oriented instructor-manipulator dialogue, whereby one dialogue participant initiates a reference to an entity in a common environment while the other participant must resolve this reference in order to manipulate said entity. Many of these works are based on disparate if nevertheless similar datasets. This paper described an English corpus of referring expressions in relatively free, unrestricted dialogue with physical features generated in a simulation, which facilitate analysis of dialogic linguistic phenomena regarding alignment in the formation of referring expressions known as conceptual pacts.

  • 49. Shore, Todd
    et al.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Using Lexical Alignment and Referring Ability to Address Data Sparsity in Situated Dialog Reference Resolution2018In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018, p. 2288-2297Conference paper (Refereed)
  • 50.
    Sibirtseva, Elena
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Nykvist, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Karaoguz, Hakan
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Leite, Iolanda
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kragic, Danica
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    A Comparison of Visualisation Methods for Disambiguating Verbal Requests in Human-Robot Interaction2018In: 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN) 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), 2018Conference paper (Refereed)
    Abstract [en]

    Picking up objects requested by a human user is a common task in human-robot interaction. When multiple objects match the user's verbal description, the robot needs to clarify which object the user is referring to before executing the action. Previous research has focused on perceiving user's multimodal behaviour to complement verbal commands or minimising the number of follow up questions to reduce task time. In this paper, we propose a system for reference disambiguation based on visualisation and compare three methods to disambiguate natural language instructions. In a controlled experiment with a YuMi robot, we investigated realtime augmentations of the workspace in three conditions - head-mounted display, projector, and a monitor as the baseline - using objective measures such as time and accuracy, and subjective measures like engagement, immersion, and display interference. Significant differences were found in accuracy and engagement between the conditions, but no differences were found in task time. Despite the higher error rates in the head-mounted display condition, participants found that modality more engaging than the other two, but overall showed preference for the projector condition over the monitor and head-mounted display conditions.

12 1 - 50 of 62
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf