Change search
Refine search result
123 1 - 50 of 118
Cite
Citation style
• apa
• ieee
• modern-language-association-8th-edition
• vancouver
• Other style
More styles
Language
• de-DE
• en-GB
• en-US
• fi-FI
• nn-NO
• nn-NB
• sv-SE
• Other locale
More languages
Output format
• html
• text
• asciidoc
• rtf
Rows per page
• 5
• 10
• 20
• 50
• 100
• 250
Sort
• Standard (Relevance)
• Author A-Ö
• Author Ö-A
• Title A-Ö
• Title Ö-A
• Publication type A-Ö
• Publication type Ö-A
• Issued (Oldest first)
• Created (Oldest first)
• Last updated (Oldest first)
• Disputation date (earliest first)
• Disputation date (latest first)
• Standard (Relevance)
• Author A-Ö
• Author Ö-A
• Title A-Ö
• Title Ö-A
• Publication type A-Ö
• Publication type Ö-A
• Issued (Oldest first)
• Created (Oldest first)
• Last updated (Oldest first)
• Disputation date (earliest first)
• Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
• 1.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
TU Delft Delft, Netherlands. TNO Den Haag, Netherlands. Furhat Robotics Stockholm, Sweden. KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
Effects of Different Interaction Contexts when Evaluating Gaze Models in HRI2020Conference paper (Refereed)

uses multimodal information from users engaged in a spatial reasoningtask with a robot and communicates joint attention viathe robot’s gaze behavior [25]. An initial evaluation of our systemwith adults showed it to improve users’ perceptions of therobot’s social presence. To investigate the repeatability of our priorfindings across settings and populations, here we conducted twofurther studies employing the same gaze system with the samerobot and task but in different contexts: evaluation of the systemwith external observers and evaluation with children. The externalobserver study suggests that third-person perspectives over videosof gaze manipulations can be used either as a manipulation checkbefore committing to costly real-time experiments or to furtherestablish previous findings. However, the replication of our originaladults study with children in school did not confirm the effectivenessof our gaze manipulation, suggesting that different interactioncontexts can affect the generalizability of results in human-robotinteraction gaze studies.

• 2.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
Computer-Human Interaction Lab for Learning & Instruction Ecole Polytechnique Federale de Lausanne, Switzerland.. KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
Responsive Joint Attention in Human-Robot Interaction2019Conference paper (Refereed)

Joint attention has been shown to be not only crucial for human-human interaction but also human-robot interaction. Joint attention can help to make cooperation more efficient, support disambiguation in instances of uncertainty and make interactions appear more natural and familiar. In this paper, we present an autonomous gaze system that uses multimodal perception capabilities to model responsive joint attention mechanisms. We investigate the effects of our system on people’s perception of a robot within a problem-solving task. Results from a user study suggest that responsive joint attention mechanisms evoke higher perceived feelings of social presence on scales that regard the direction of the robot’s perception.

• 3. Alexanderson, Simon
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL. KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
Style-Controllable Speech-Driven Gesture SynthesisUsing Normalising Flows2020Conference paper (Refereed)

Automatic synthesis of realistic gestures promises to transform the fields of animation, avatars and communicative agents. In off-line applications, novel tools can alter the role of an animator to that of a director, who provides only high-level input for the desired animation; a learned network then translates these instructions into an appropriate sequence of body poses. In interactive scenarios, systems for generating natural animations on the fly are key to achieving believable and relatable characters. In this paper we address some of the core issues towards these ends. By adapting a deep learning-based motion synthesis method called MoGlow, we propose a new generative model for generating state-of-the-art realistic speech-driven gesticulation. Owing to the probabilistic nature of the approach, our model can produce a battery of different, yet plausible, gestures given the same input speech signal. Just like humans, this gives a rich natural variation of motion. We additionally demonstrate the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent. Such control can be leveraged to convey a desired character personality or mood. We achieve all this without any manual annotation of the data. User studies evaluating upper-body gesticulation confirm that the generated motions are natural and well match the input speech. Our method scores above all prior systems and baselines on these measures, and comes close to the ratings of the original recorded motions. We furthermore find that we can accurately control gesticulation styles without unnecessarily compromising perceived naturalness. Finally, we also demonstrate an application of the same method to full-body gesticulation, including the synthesis of stepping motion and stance.

• 4. Ambrazaitis, G.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
Multimodal prominences: Exploring the patterning and usage of focal pitch accents, head beats and eyebrow beats in Swedish television news readings2017In: Speech Communication, ISSN 0167-6393, E-ISSN 1872-7182, Vol. 95, p. 100-113Article in journal (Refereed)

Facial beat gestures align with pitch accents in speech, functioning as visual prominence markers. However, it is not yet well understood whether and how gestures and pitch accents might be combined to create different types of multimodal prominence, and how specifically visual prominence cues are used in spoken communication. In this study, we explore the use and possible interaction of eyebrow (EB) and head (HB) beats with so-called focal pitch accents (FA) in a corpus of 31 brief news readings from Swedish television (four news anchors, 986 words in total), focusing on effects of position in text, information structure as well as speaker expressivity. Results reveal an inventory of four primary (combinations of) prominence markers in the corpus: FA+HB+EB, FA+HB, FA only (i.e., no gesture), and HB only, implying that eyebrow beats tend to occur only in combination with the other two markers. In addition, head beats occur significantly more frequently in the second than in the first part of a news reading. A functional analysis of the data suggests that the distribution of head beats might to some degree be governed by information structure, as the text-initial clause often defines a common ground or presents the theme of the news story. In the rheme part of the news story, FA, HB, and FA+HB are all common prominence markers. The choice between them is subject to variation which we suggest might represent a degree of freedom for the speaker to use the markers expressively. A second main observation concerns eyebrow beats, which seem to be used mainly as a kind of intensification marker for highlighting not only contrast, but also value, magnitude, or emotionally loaded words; it is applicable in any position in a text. We thus observe largely different patterns of occurrence and usage of head beats on the one hand and eyebrow beats on the other, suggesting that the two represent two separate modalities of visual prominence cuing.

• 5.
KTH.
KTH. KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
A Collaborative Previsualization Tool for Filmmaking in Virtual Reality2019In: Proceedings - CVMP 2019: 16th ACM SIGGRAPH European Conference on Visual Media Production, ACM Digital Library, 2019Conference paper (Refereed)

Previsualization is a process within pre-production of filmmaking where filmmakers can visually plan specific scenes with camera works, lighting, character movements, etc. The costs of computer graphics-based effects are substantial within film production. Using previsualization, these scenes can be planned in detail to reduce the amount of work put on effects in the later production phase. We develop and assess a prototype for previsualization in virtual reality for collaborative purposes where multiple filmmakers can be present in a virtual environment to share a creative work experience, remotely. By performing a within-group study on 20 filmmakers, our findings show that the use of virtual reality for distributed, collaborative previsualization processes is useful for real-life pre-production purposes.

• 6. Arnela, Marc
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligenta system, Speech, Music and Hearing, TMH, Speech Communication and Technology.
MRI-based vocal tract representations for the three-dimensional finite element synthesis of diphthongs2019In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 27, no 12, p. 2173-2182Article in journal (Refereed)

The synthesis of diphthongs in three-dimensions (3D) involves the simulation of acoustic waves propagating through a complex 3D vocal tract geometry that deforms over time. Accurate 3D vocal tract geometries can be extracted from Magnetic Resonance Imaging (MRI), but due to long acquisition times, only static sounds can be currently studied with an adequate spatial resolution. In this work, 3D dynamic vocal tract representations are built to generate diphthongs, based on a set of cross-sections extracted from MRI-based vocal tract geometries of static vowel sounds. A diphthong can then be easily generated by interpolating the location, orientation and shape of these cross-sections, thus avoiding the interpolation of full 3D geometries. Two options are explored to extract the cross-sections. The first one is based on an adaptive grid (AG), which extracts the cross-sections perpendicular to the vocal tract midline, whereas the second one resorts to a semi-polar grid (SPG) strategy, which fixes the cross-section orientations. The finite element method (FEM) has been used to solve the mixed wave equation and synthesize diphthongs [${\alpha i}$] and [${\alpha u}$] in the dynamic 3D vocal tracts. The outputs from a 1D acoustic model based on the Transfer Matrix Method have also been included for comparison. The results show that the SPG and AG provide very close solutions in 3D, whereas significant differences are observed when using them in 1D. The SPG dynamic vocal tract representation is recommended for 3D simulations because it helps to prevent the collision of adjacent cross-sections.

• 7.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
Modelling Adaptive Presentations in Human-Robot Interaction using Behaviour Trees2019In: 20th Annual Meeting of the Special Interest Group on Discourse and Dialogue: Proceedings of the Conference / [ed] Satoshi Nakamura, Stroudsburg, PA, 2019, p. 345-352Conference paper (Refereed)

In dialogue, speakers continuously adapt their speech to accommodate the listener, based on the feedback they receive. In this paper, we explore the modelling of such behaviours in the context of a robot presenting a painting. A Behaviour Tree is used to organise the behaviour on different levels, and allow the robot to adapt its behaviour in real-time; the tree organises engagement, joint attention, turn-taking, feedback and incremental speech processing. An initial implementation of the model is presented, and the system is evaluated in a user study, where the adaptive robot presenter is compared to a non-adaptive version. The adaptive version is found to be more engaging by the users, although no effects are found on the retention of the presented material.

• 8. Betz, Simon
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
The greennn tree - lengthening position influences uncertainty perception2019In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2019, The International Speech Communication Association (ISCA), 2019, p. 3990-3994Conference paper (Refereed)

Synthetic speech can be used to express uncertainty in dialogue systems by means of hesitation. If a phrase like “Next to the green tree” is uttered in a hesitant way, that is, containing lengthening, silences, and fillers, the listener can infer that the speaker is not certain about the concepts referred to. However, we do not know anything about the referential domain of the uncertainty; if only a particular word in this sentence would be uttered hesitantly, e.g. “the greee:n tree”, the listener could infer that the uncertainty refers to the color in the statement, but not to the object. In this study, we show that the domain of the uncertainty is controllable. We conducted an experiment in which color words in sentences like “search for the green tree” were lengthened in two different positions: word onsets or final consonants, and participants were asked to rate the uncertainty regarding color and object. The results show that initial lengthening is predominantly associated with uncertainty about the word itself, whereas final lengthening is primarily associated with the following object. These findings enable dialogue system developers to finely control the attitudinal display of uncertainty, adding nuances beyond the lexical content to message delivery.

• 9.
Centre for Systematic Musicology, University of Graz, Graz, Austria.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Centre for Systematic Musicology, University of Graz, Graz, Austria.
A Computational Model of Immanent Accent Salience in Tonal Music2019In: Frontiers in Psychology, ISSN 1664-1078, E-ISSN 1664-1078, Vol. 10, no 317, p. 1-19Article in journal (Refereed)

Accents are local musical events that attract the attention of the listener, and can be either immanent (evident from the score) or performed (added by the performer). Immanent accents involve temporal grouping (phrasing), meter, melody, and harmony; performed accents involve changes in timing, dynamics, articulation, and timbre. In the past, grouping, metrical and melodic accents were investigated in the context of expressive music performance. We present a novel computational model of immanent accent salience in tonal music that automatically predicts the positions and saliences of metrical, melodic and harmonic accents. The model extends previous research by improving on preliminary formulations of metrical and melodic accents and introducing a new model for harmonic accents that combines harmonic dissonance and harmonic surprise. In an analysis-by-synthesis approach, model predictions were compared with data from two experiments, respectively involving 239 sonorities and 638 sonorities, and 16 musicians and 5 experts in music theory. Average pair-wise correlations between raters were lower for metrical (0.27) and melodic accents (0.37) than for harmonic accents (0.49). In both experiments, when combining all the raters into a single measure expressing their consensus, correlations between ratings and model predictions ranged from 0.43 to 0.62. When different accent categories of accents were combined together, correlations were higher than for separate categories (r = 0.66). This suggests that raters might use strategies different from individual metrical, melodic or harmonic accent models to mark the musical events.

• 10. Borin, L.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
Språkbanken 2018: Research resources for text, speech, & society2018In: CEUR Workshop Proceedings, CEUR-WS , 2018, p. 504-506Conference paper (Refereed)

We introduce an expanded version of the Swedish research resource Språkbanken (the Swedish Language Bank). In 2018, Språkbanken, which has supported national and international research for over four decades, adds two branches, one focusing on speech and one on societal aspects of language, to its existing organization, which targets text.

• 11.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
KTH, Superseded Departments (pre-2005), Speech, Music and Hearing. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
Toward a new model for sound control2001In: Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-01), Limerick, Ireland, December 6-8, 200 / [ed] Fernström, M., Brazil, E., & Marshall, M., 2001, p. 45-49Conference paper (Refereed)

The control of sound synthesis is a well-known problem. This is particularly true if the sounds are generated with physical modeling techniques that typically need specification of numerous control parameters. In the present work outcomes from studies on automatic music performance are used for tackling this problem.

• 12. Chettri, B.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
Ensemble models for spoofing detection in automatic speaker verification2019In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2019, International Speech Communication Association, 2019, p. 1018-1022Conference paper (Refereed)

Detecting spoofing attempts of automatic speaker verification (ASV) systems is challenging, especially when using only one modelling approach. For robustness, we use both deep neural networks and traditional machine learning models and combine them as ensemble models through logistic regression. They are trained to detect logical access (LA) and physical access (PA) attacks on the dataset released as part of the ASV Spoofing and Countermeasures Challenge 2019. We propose dataset partitions that ensure different attack types are present during training and validation to improve system robustness. Our ensemble model outperforms all our single models and the baselines from the challenge for both attack types. We investigate why some models on the PA dataset strongly outperform others and find that spoofed recordings in the dataset tend to have longer silences at the end than genuine ones. By removing them, the PA task becomes much more challenging, with the tandem detection cost function (t-DCF) of our best single model rising from 0.1672 to 0.5018 and equal error rate (EER) increasing from 5.98% to 19.8% on the development set.

• 13.
Queen Mary Univ London, Sch EECS, London, England..
Queen Mary Univ London, Sch EECS, London, England.. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Queen Mary Univ London, Sch EECS, London, England..
Analysing the predictions of a CNN-based replay spoofing detection system2018In: 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), IEEE , 2018, p. 92-97Conference paper (Refereed)

Playing recorded speech samples of an enrolled speaker – “replay attack” – is a simple approach to bypass an automatic speaker ver- ification (ASV) system. The vulnerability of ASV systems to such attacks has been acknowledged and studied, but there has been no research into what spoofing detection systems are actually learning to discriminate. In this paper, we analyse the local behaviour of a replay spoofing detection system based on convolutional neural net- works (CNNs) adapted from a state-of-the-art CNN (LC N NF F T ) submitted at the ASVspoof 2017 challenge. We generate tempo- ral and spectral explanations for predictions of the model using the SLIME algorithm. Our findings suggest that in most instances of spoofing the model is using information in the first 400 milliseconds of each audio instance to make the class prediction. Knowledge of the characteristics that spoofing detection systems are exploiting can help build less vulnerable ASV systems, other spoofing detection systems, as well as better evaluation databases.

• 14.
Queen Mary Univ London, Sch EECS, London, England..
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Queen Mary Univ London, Sch EECS, London, England.. Queen Mary Univ London, Sch EECS, London, England..
ANALYSING REPLAY SPOOFING COUNTERMEASURE PERFORMANCE UNDER VARIED CONDITIONS2018In: 2018 IEEE 28TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP) / [ed] Pustelnik, N Ma, Z Tan, ZH Larsen, J, IEEE , 2018Conference paper (Refereed)

In this paper, we aim to understand what makes replay spoofing detection difficult in the context of the ASVspoof 2017 corpus. We use FFT spectra, mel frequency cepstral coefficients (MFCC) and inverted MFCC (IMFCC) frontends and investigate different back-ends based on Convolutional Neural Networks (CNNs), Gaussian Mixture Models (GMMs) and Support Vector Machines (SVMs). On this database, we find that IMFCC frontend based systems show smaller equal error rate (EER) for high quality replay attacks but higher EER for low quality replay attacks in comparison to the baseline. However, we find that it is not straightforward to understand the influence of an acoustic environment (AE), a playback device (PD) and a recording device (RD) of a replay spoofing attack. One reason is the unavailability of metadata for genuine recordings. Second, it is difficult to account for the effects of the factors: AE, PD and RD, and their interactions. Finally, our frame-level analysis shows that the presence of cues (recording artefacts) in the first few frames of genuine signals (missing from replayed ones) influence class prediction.

• 15.
Univ Coll Dublin, Dublin, Ireland..
Univ Coll Dublin, Dublin, Ireland.. Univ Coll Dublin, Dublin, Ireland.. Univ Toronto, Mississauga, ON, Canada.;Univ Toronto, Toronto, ON, Canada.. Univ Toronto, Mississauga, ON, Canada.;Univ Toronto, Toronto, ON, Canada.. CereProc Ltd, Edinburgh, Midlothian, Scotland.. Univ Sheffield, Sheffield, S Yorkshire, England.. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Queen Mary Univ London, London, England.. Trinity Coll Dublin, Dublin, Ireland.. Trinity Coll Dublin, Dublin, Ireland.. Voysis Ltd, Dublin, Ireland..
Mapping Theoretical and Methodological Perspectives for Understanding Speech Interface Interactions2019In: CHI EA '19 EXTENDED ABSTRACTS: EXTENDED ABSTRACTS OF THE 2019 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, ASSOC COMPUTING MACHINERY , 2019Conference paper (Refereed)

The use of speech as an interaction modality has grown considerably through the integration of Intelligent Personal Assistants (IPAs- e.g. Siri, Google Assistant) into smartphones and voice based devices (e.g. Amazon Echo). However, there remain significant gaps in using theoretical frameworks to understand user behaviours and choices and how they may applied to specific speech interface interactions. This part-day multidisciplinary workshop aims to critically map out and evaluate theoretical frameworks and methodological approaches across a number of disciplines and establish directions for new paradigms in understanding speech interface user behaviour. In doing so, we will bring together participants from HCI and other speech related domains to establish a cohesive, diverse and collaborative community of researchers from academia and industry with interest in exploring theoretical and methodological issues in the field.

• 16.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
Computational Modeling of the Vocal Tract: Applications to Speech Production2018Doctoral thesis, comprehensive summary (Other academic)

Human speech production is a complex process, involving neuromuscular control signals, the effects of articulators' biomechanical properties and acoustic wave propagation in a vocal tract tube of intricate shape. Modeling these phenomena may play an important role in advancing our understanding of the involved mechanisms, and may also have future medical applications, e.g., guiding doctors in diagnosing, treatment planning, and surgery prediction of related disorders, ranging from oral cancer, cleft palate, obstructive sleep apnea, dysphagia, etc.

A more complete understanding requires models that are as truthful representations as possible of the phenomena. Due to the complexity of such modeling, simplifications have nevertheless been used extensively in speech production research: phonetic descriptors (such as the position and degree of the most constricted part of the vocal tract) are used as control signals, the articulators are represented as two-dimensional geometrical models, the vocal tract is considered as a smooth tube and plane wave propagation is assumed, etc.

This thesis aims at firstly investigating the consequences of such simplifications, and secondly at contributing to establishing unified modeling of the speech production process, by connecting three-dimensional biomechanical modeling of the upper airway with three-dimensional acoustic simulations. The investigation on simplifying assumptions demonstrated the influence of vocal tract geometry features — such as shape representation, bending and lip shape — on its acoustic characteristics, and that the type of modeling — geometrical or biomechanical — affects the spatial trajectories of the articulators, as well as the transition of formant frequencies in the spectrogram.

The unification of biomechanical and acoustic modeling in three-dimensions allows to realistically control the acoustic output of dynamic sounds, such as vowel-vowel utterances, by contraction of relevant muscles. This moves and shapes the speech articulators that in turn dene the vocal tract tube in which the wave propagation occurs. The main contribution of the thesis in this line of work is a novel and complex method that automatically reconstructs the shape of the vocal tract from the biomechanical model. This step is essential to link biomechanical and acoustic simulations, since the vocal tract, which anatomically is a cavity enclosed by different structures, is only implicitly defined in a biomechanical model constituted of several distinct articulators.

• 17.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
GTM Grup de recerca en Tecnologies Mèdia, La Salle, Universitat Ramon Llull, Barcelona, Spain. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. GTM Grup de recerca en Tecnologies Mèdia, La Salle, Universitat Ramon Llull, Barcelona, Spain.
Reconstruction of vocal tract geometries from biomechanical simulations2018In: International Journal for Numerical Methods in Biomedical Engineering, ISSN 2040-7939, E-ISSN 2040-7947Article in journal (Refereed)

Medical imaging techniques are usually utilized to acquire the vocal tract geometry in 3D, which may then be used, eg, for acoustic/fluid simulation. As an alternative, such a geometry may also be acquired from a biomechanical simulation, which allows to alter the anatomy and/or articulation to study a variety of configurations. In a biomechanical model, each physical structure is described by its geometry and its properties (such as mass, stiffness, and muscles). In such a model, the vocal tract itself does not have an explicit representation, since it is a cavity rather than a physical structure. Instead, its geometry is defined implicitly by all the structures surrounding the cavity, and such an implicit representation may not be suitable for visualization or for acoustic/fluid simulation. In this work, we propose a method to reconstruct the vocal tract geometry at each time step during the biomechanical simulation. Complexity of the problem, which arises from model alignment artifacts, is addressed by the proposed method. In addition to the main cavity, other small cavities, including the piriform fossa, the sublingual cavity, and the interdental space, can be reconstructed. These cavities may appear or disappear by the position of the larynx, the mandible, and the tongue. To illustrate our method, various static and temporal geometries of the vocal tract are reconstructed and visualized. As a proof of concept, the reconstructed geometries of three cardinal vowels are further used in an acoustic simulation, and the corresponding transfer functions are derived.

• 18.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
Synthesis of vowels and vowel-vowel utterancesusing a 3D biomechanical-acoustic model2018In: IEEE/ACM Transactions on Audio, Speech, and Language Processing, ISSN 2329-9290Article in journal (Refereed)

A link is established between a 3D biomechanicaland acoustic model allowing for the umerical synthesis of vowelsounds by contraction of the relevant muscles. That is, thecontraction of muscles in the biomechanical model displacesand deforms the articulators, which in turn deform the vocaltract shape. The mixed wave equation for the acoustic pressureand particle velocity is formulated in an arbitrary Lagrangian-Eulerian framework to account for moving boundaries. Theequations are solved numerically using the finite element method.Since the activation of muscles are not fully known for a givenvowel sound, an inverse method is employed to calculate aplausible activation pattern. For vowel-vowel utterances, two different approaches are utilized: linear interpolation in eithermuscle activation or geometrical space. Although the former isthe natural choice for biomechanical modeling, the latter is usedto investigate the contribution of biomechanical modeling onspeech acoustics. Six vowels [ɑ, ə, ɛ, e, i, ɯ] and three vowel-vowelutterances [ɑi, ɑɯ, ɯi] are synthesized using the 3D model. Results,including articulation, formants, and spectrogram of vowelvowelsounds, are in agreement with previous studies.Comparingthe spectrogram of interpolation in muscle and geometrical spacereveals differences in all frequencies, with the most extendeddifference in the second formant transition.

• 19.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
From Tongue Movement Data to Muscle Activation – A Preliminary Study of Artisynth's Inverse Modelling2014Conference paper (Other academic)

Finding the muscle activations during speech production is an important part of developing a comprehensive biomechanical model of speech production. Although there are some direct ways, like Electromyography, for measuring muscle activations, these methods usually are highly invasive and sometimes not reliable. They are more over impossible to use for all muscles. In this study we therefore explore an indirect way to estimate tongue muscle activations during speech production by combining Electromagnetic Articulography (EMA) measurements of tongue movements and the inverse modeling in Artisynth. With EMA we measure the time-changing 3D positions of four sensors attached to the tongue surface for a Swedish female subject producing vowel-vowel and vowelconsonant-vowel (VCV) sequences. The measured sensor positions are used as target points for corresponding virtual sensors introduced in the tongue model of Artisynth’s inverse modelling framework, which computes one possible combination of muscle activations that results in the observed sequence of tongue articulations. We present resynthesized tongue movements in the Artisynth model and verify the results by comparing the calculated muscle activations with literature.

• 20.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. ENSTA ParisTech.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
Predicting Perceived Dissonance of Piano Chords Using a Chord-Class Invariant CNN and Deep Layered Learning2019In: Proceedings of 16th Sound & Music Computing Conference (SMC), Malaga, Spain, 2019, p. 530-536Conference paper (Refereed)

This paper presents a convolutional neural network (CNN) able to predict the perceived dissonance of piano chords. Ratings of dissonance for short audio excerpts were com- bined from two different datasets and groups of listeners. The CNN uses two branches in a directed acyclic graph (DAG). The first branch receives input from a pitch esti- mation algorithm, restructured into a pitch chroma. The second branch analyses interactions between close partials, known to affect our perception of dissonance and rough- ness. The analysis is pitch invariant in both branches, fa- cilitated by convolution across log-frequency and octave- wide max-pooling. Ensemble learning was used to im- prove the accuracy of the predictions. The coefficient of determination (R2) between rating and predictions are close to 0.7 in a cross-validation test of the combined dataset. The system significantly outperforms recent computational models. An ablation study tested the impact of the pitch chroma and partial analysis branches separately, conclud- ing that the deep layered learning approach with a pitch chroma was driving the high performance.

• 21.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
Deep Layered Learning in MIRManuscript (preprint) (Other academic)

Deep learning has boosted the performance of many music information retrieval (MIR) systems in recent years. Yet, the complex hierarchical arrangement of music makes end-to-end learning hard for some MIR tasks – a very deep and structurally flexible processing chain is necessary to extract high-level features from a spectrogram representation. Mid-level representations such as tones, pitched onsets, chords, and beats are fundamental building blocks of music. This paper discusses how these can be used as intermediate representations in MIR to facilitate deep processing that generalizes well: each music concept is predicted individually in learning modules that are connected through latent representations in a directed acyclic graph. It is suggested that this strategy for inference, defined as deep layered learning (DLL), can help generalization by (1) – enforcing the validity of intermediate representations during processing, and by (2) – letting the inferred representations establish disentangled structures that support high-level invariant processing. A background to DLL and modular music processing is provided, and relevant concepts such as pruning, skip connections, and layered performance supervision are reviewed.

• 22.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.
Modeling Music: Studies of Music Transcription, Music Perception and Music Production2018Doctoral thesis, comprehensive summary (Other academic)

This dissertation presents ten studies focusing on three important subfields of music information retrieval (MIR): music transcription (Part A), music perception (Part B), and music production (Part C).

In Part A, systems capable of transcribing rhythm and polyphonic pitch are described. The first two publications present methods for tempo estimation and beat tracking. A method is developed for computing the most salient periodicity (the “cepstroid”), and the computed cepstroid is used to guide the machine learning processing. The polyphonic pitch tracking system uses novel pitch-invariant and tone-shift-invariant processing techniques. Furthermore, the neural flux is introduced – a latent feature for onset and offset detection. The transcription systems use a layered learning technique with separate intermediate networks of varying depth.  Important music concepts are used as intermediate targets to create a processing chain with high generalization. State-of-the-art performance is reported for all tasks.

Part B is devoted to perceptual features of music, which can be used as intermediate targets or as parameters for exploring fundamental music perception mechanisms. Systems are proposed that can predict the perceived speed and performed dynamics of an audio file with high accuracy, using the average ratings from around 20 listeners as ground truths. In Part C, aspects related to music production are explored. The first paper analyzes long-term average spectrum (LTAS) in popular music. A compact equation is derived to describe the mean LTAS of a large dataset, and the variation is visualized. Further analysis shows that the level of the percussion is an important factor for LTAS. The second paper examines songwriting and composition through the development of an algorithmic composer of popular music. Various factors relevant for writing good compositions are encoded, and a listening test employed that shows the validity of the proposed methods.

The dissertation is concluded by Part D - Looking Back and Ahead, which acts as a discussion and provides a road-map for future work. The first paper discusses the deep layered learning (DLL) technique, outlining concepts and pointing out a direction for future MIR implementations. It is suggested that DLL can help generalization by enforcing the validity of intermediate representations, and by letting the inferred representations establish disentangled structures supporting high-level invariant processing. The second paper proposes an architecture for tempo-invariant processing of rhythm with convolutional neural networks. Log-frequency representations of rhythm-related activations are suggested at the main stage of processing. Methods relying on magnitude, relative phase, and raw phase information are described for a wide variety of rhythm processing tasks.

• 23.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.
Polyphonic Pitch Tracking with Deep Layered LearningManuscript (preprint) (Other academic)

This paper presents a polyphonic pitch tracking system able to extract both framewise and note-based estimates from audio. The system uses six artificial neural networks in a deep layered learning setup. First, cascading networks are applied to a spectrogram for framewise fundamental frequency (f0) estimation. A sparse receptive field is learned by the first network and then used for weight-sharing throughout the system. The f0 activations are connected across time to extract pitch ridges. These ridges define a framework, within which subsequent networks perform tone-shift-invariant onset and offset detection. The networks convolve the pitch ridges across time, using as input, e.g., variations of latent representations from the f0 estimation networks, defined as the “neural flux.” Finally, incorrect tentative notes are removed one by one in an iterative procedure that allows a network to classify notes within an accurate context. The system was evaluated on four public test sets: MAPS, Bach10, TRIOS, and the MIREX Woodwind quintet, and performed state-of-the-art results for all four datasets. It performs well across all subtasks: f0, pitched onset, and pitched offset tracking.

• 24.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.
Tempo-Invariant Processing of Rhythm with Convolutional Neural NetworksManuscript (preprint) (Other academic)

Rhythm patterns can be performed with a wide variation of tempi. This presents a challenge for many music information retrieval (MIR) systems; ideally, perceptually similar rhythms should be represented and processed similarly, regardless of the specific tempo at which they were performed. Several recent systems for tempo estimation, beat tracking, and downbeat tracking have therefore sought to process rhythm in a tempo-invariant way, often by sampling input vectors according to a precomputed pulse level. This paper describes how a log-frequency representation of rhythm-related activations instead can promote tempo invariance when processed with convolutional neural networks. The strategy incorporates invariance at a fundamental level and can be useful for most tasks related to rhythm processing. Different methods are described, relying on magnitude, phase relationships of different rhythm channels, as well as raw phase information. Several variations are explored to provide direction for future implementations.

• 25.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
Modeling Music Modality with a Key-Class Invariant Pitch Chroma CNN2019Conference paper (Refereed)

This paper presents a convolutional neural network (CNN) that uses input from a polyphonic pitch estimation system to predict perceived minor/major modality in music audio. The pitch activation input is structured to allow the first CNN layer to compute two pitch chromas focused on dif-ferent octaves. The following layers perform harmony analysis across chroma and time scales. Through max pooling across pitch, the CNN becomes invariant with re-gards to the key class (i.e., key disregarding mode) of the music. A multilayer perceptron combines the modality ac-tivation output with spectral features for the final predic-tion. The study uses a dataset of 203 excerpts rated by around 20 listeners each, a small challenging data size re-quiring a carefully designed parameter sharing. With an R2 of about 0.71, the system clearly outperforms previous sys-tems as well as individual human listeners. A final ablation study highlights the importance of using pitch activations processed across longer time scales, and using pooling to facilitate invariance with regards to the key class.

• 26.
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
Åhlund, Anna (Contributor)
Robot interaction styles for conversation practice in second language learning2020In: International Journal of Social Robotics, ISSN 1875-4791, E-ISSN 1875-4805Article in journal (Refereed)

Four different interaction styles for the social robot Furhat acting as a host in spoken conversation practice with two simultaneous language learners have been developed, based on interaction styles of human moderators of language cafés.We first investigated, through a survey and recorded sessions of three-party language café style conversations, how the interaction styles of human moderators are influenced by different factors (e.g., the participants language level and familiarity).Using this knowledge, four distinct interaction styles were developed for the robot: sequentially asking one participant questions at the time (Interviewer); the robot speaking about itself, robots and Sweden or asking quiz questions about Sweden (Narrator); attempting to make the participants talk with each other (Facilitator); and trying to establish a three-party robot-learner-learner interaction with equal participation (Interlocutor).A user study with 32 participants, conversing in pairs with the robot, was carried out to investigate how the post-session ratings of the robot's behavior along different dimensions (e.g., the robot's conversational skills and friendliness, the value of practice) are influenced by the robot's interaction style and participant variables (e.g., level in the target language, gender, origin).The general findings were that Interviewer received the highest mean rating, but that different factors influenced the ratings substantially, indicating that the preference of individual participants needs to be anticipated in order to improve learner satisfaction with the practice. We conclude with a list of recommendations for robot-hosted conversation practice in a second language.

• 27.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
A tool for exploring large amounts of found audio data2018In: CEUR Workshop Proceedings, CEUR-WS , 2018, p. 499-503Conference paper (Refereed)

We demonstrate a method and a set of open source tools (beta) for nonsequential browsing of large amounts of audio data. The demonstration will contain versions of a set of functionalities in their first stages, and will provide a good insight in how the method can be used to browse through large quantities of audio data efficiently.

• 28.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
Designing joint attention systems for robots that assist children with autism spectrum disorders2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis

Joint attention behaviours play a central role in natural and believable human-robot interactions. This research presents the design decisions of a semi-autonomous joint attention robotic system, together with the evaluation of its effectiveness and perceived social presence across different cognitive ability groups. For this purpose, two different studies were carried out: first with adults, and then with children between 10 and 12 years-old.

The overall results for both studies reflect a system that is perceived as socially present and engaging which can successfully establish joint attention with the participants. When comparing the performance results between the two groups, children achieved higher joint attention scores and reported a higher level of enjoyment and helpfulness in the interaction.

Furthermore, a detailed literature review on robot-assisted therapies for children with autism spectrum disorders is presented, focusing on the development of joint attention skills. The children’s positive interaction results from the studies, together with state-of-the-art research therapies and the input from an autism therapist, guided the author to elaborate some design guidelines for a robotic system to assist in joint attention focused autism therapies.

• 29. Finkel, Sebastian
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
Intermittent theta burst stimulation over right somatosensory larynx cortex enhances vocal pitch‐regulation in nonsingers2019In: Human Brain Mapping, ISSN 1065-9471, E-ISSN 1097-0193Article in journal (Refereed)

While the significance of auditory cortical regions for the development and maintenance of speech motor coordination is well established, the contribution of somatosensory brain areas to learned vocalizations such as singing is less well understood. To address these mechanisms, we applied intermittent theta burst stimulation (iTBS), a facilitatory repetitive transcranial magnetic stimulation (rTMS) protocol, over right somatosensory larynx cortex (S1) and a nonvocal dorsal S1 control area in participants without singing experience. A pitch‐matching singing task was performed before and after iTBS to assess corresponding effects on vocal pitch regulation. When participants could monitor auditory feedback from their own voice during singing (Experiment I), no difference in pitch‐matching performance was found between iTBS sessions. However, when auditory feedback was masked with noise (Experiment II), only larynx‐S1 iTBS enhanced pitch accuracy (50–250 ms after sound onset) and pitch stability (>250 ms after sound onset until the end). Results indicate that somatosensory feedback plays a dominant role in vocal pitch regulation when acoustic feedback is masked. The acoustic changes moreover suggest that right larynx‐S1 stimulation affected the preparation and involuntary regulation of vocal pitch accuracy, and that kinesthetic‐proprioceptive processes play a role in the voluntary control of pitch stability in nonsingers. Together, these data provide evidence for a causal involvement of right larynx‐S1 in vocal pitch regulation during singing.

• 30.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
Commentary on Polak How short is the shortest metric subdivision?2017In: Empirical Musicology Review, ISSN 1559-5749, E-ISSN 1559-5749, Vol. 12, no 3-4, p. 227-228Article in journal (Other academic)

This commentary relates to the target paper by Polak on the shortest metric subdivision by presenting measurements on West-African drum music. It provides new evidence that the perceptual lower limit of tone duration is within the range 80-100 ms. Using fairly basic measurement techniques in combination with a musical analysis of the content, the original results in this study represents a valuable addition to the literature. Considering the relevance for music listening, further research would be valuable for determining and understanding the nature of this perceptual limit.

• 31.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Inst Pasteur, France. Univ Bologna, Dept Educ Studies, Bologna, Italy.. Univ Bologna, Dept Arts, Bologna, Italy..
Probing the Underlying Principles of Perceived Immanent Accents Using a Modeling Approach2019In: Frontiers in Psychology, ISSN 1664-1078, E-ISSN 1664-1078, Vol. 10, article id 1024Article in journal (Refereed)

This article deals with the question of how the perception of the "immanent accents" can be predicted and modeled. By immanent accent we mean any musical event in the score that is related to important points in the musical structure (e.g., tactus positions, melodic peaks) and is therefore able to capture the attention of a listener. Our aim was to investigate the underlying principles of these accented notes by combining quantitative modeling, music analysis and experimental methods. A listening experiment was conducted where 30 participants indicated perceived accented notes for 60 melodies, vocal and instrumental, selected from Baroque, Romantic and Posttonal styles. This produced a large and unique collection of perceptual data about the perceived immanent accents, organized by styles consisting of vocal and instrumental melodies within Western art music. The music analysis of the indicated accents provided a preliminary list of musical features that could be identified as possible reasons for the raters' perception of the immanent accents. These features related to the score in different ways, e.g., repeated fragments, single notes, or overall structure. A modeling approach was used to quantify the influence of feature groups related to pitch contour, tempo, timing, simple phrasing, and meter. A set of 43 computational features was defined from the music analysis and previous studies and extracted from the score representation. The mean ratings of the participants were predicted using multiple linear regression and support vector regression. The latter method (using cross-validation) obtained the best result of about 66% explained variance (r = 0.81) across all melodies and for a selected group of raters. The independent contribution of each feature group was relatively high for pitch contour and timing (9.6 and 7.0%). There were also significant contributions from tempo (4.5%), simple phrasing (4.4%), and meter (3.9%). Interestingly, the independent contribution varied greatly across participants, implying different listener strategies, and also some variability across different styles. The large differences among listeners emphasize the importance of considering the individual listener's perception in future research in music perception.

• 32.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST). KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Institute for Research and Coordination in Acoustics and Music, Paris, France. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
Prediction of three articulatory categories in vocal sound imitations using models for auditory receptive fields2018In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 144, no 3, p. 1467-1483Article in journal (Refereed)

Vocal sound imitations provide a new challenge for understanding the coupling between articulatory mechanisms and the resulting audio. In this study, we have modeled the classification of three articulatory categories, phonation, supraglottal myoelastic vibrations, and turbulence from audio recordings. Two data sets were assembled, consisting of different vocal imitations by four professional imitators and four non-professional speakers in two different experiments. The audio data were manually annotated by two experienced phoneticians using a detailed articulatory description scheme. A separate set of audio features was developed specifically for each category using both time-domain and spectral methods. For all time-frequency transformations, and for some secondary processing, the recently developed Auditory Receptive Fields Toolbox was used. Three different machine learning methods were applied for predicting the final articulatory categories. The result with the best generalization was found using an ensemble of multilayer perceptrons. The cross-validated classification accuracy was 96.8 % for phonation, 90.8 % for supraglottal myoelastic vibrations, and 89.0 % for turbulence using all the 84 developed features. A final feature reduction to 22 features yielded similar results.

• 33. Gill, Brian P.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. Stockholm Musikpedagogiska Institut.
Spectrum Effects of a Velopharyngeal Opening in Singing2018In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588Article in journal (Refereed)

The question whether or not a velopharyngeal opening is advantageous in singing has been discussed for a very long time among teachers of singing. The present investigation analyzes the acoustic consequences of a large, a narrow, and a nonexistent velopharyngeal opening (VPO). A divided flow mask (nasal and oral) connected to flow transducers recorded the nasal and oral DC flows in four female and four male classically trained singers while they sang vowel sequences at different pitches under these three experimental conditions. Acoustic effects were analyzed in three long-term average spectra parameters: (i) the sound level at the fundamental frequency, (ii) the level of the highest peak below 1 kHz, and (iii) the level of the highest peak in the 2–4 kHz region. For a narrow VPO, an increase in the level of the highest peak in the 2–4 kHz region was observed. As this peak is an essential voice component in the classical singing tradition, a narrow VPO seems beneficial in this type of singing.

• 34.
Royal College of Music in Stockholm, Sweden.
KTH, School of Electrical Engineering and Computer Science (EECS), Media Technology and Interaction Design, MID. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
Developing a Method for Identifying Improvisation Strategies in Jazz Duos2019In: Proc. of the 14th International Symposium on CMMR / [ed] M. Aramaki, O. Derrien, R. Kronland-Martinet, S. Ystad, Marseille Cedex, 2019, p. 482-489Conference paper (Refereed)

The primary purpose of this paper is to describe a method to investigate the communication process between musicians performing improvisation in jazz. This method was applied in a first case study. The paper contributes to jazz improvisation theory towards embracing more artistic expressions and choices made in real life musical situations. In jazz, applied improvisation theory usually consists of scale and harmony studies within quantized rhythmic patterns. The ensembles in the study were duos performed by the author at the piano and horn players (trumpet, alto saxophone, clarinet and trombone). Recording sessions involving the ensembles were conducted. The recording was transcribed using software and the produced score together with the audio recording was used when conducting in-depth interviews, to identify the horn player’s underlying musical strategies. The strategies were coded according to previous research.

• 35.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
SpaceRef: a corpus of street-level geographic descriptions2016In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), 2016, p. 3822-3827Conference paper (Refereed)

This article describes SPACEREF, a corpus of street-level geographic descriptions. Pedestrians are walking a route in a (real) urban environment, describing their actions. Their position is automatically logged, their speech is manually transcribed, and their references to objects are manually annotated with respect to a crowdsourced geographic database. We describe how the data was collected and annotated, and how it has been used in the context of creating resources for an automatic pedestrian navigation system.

• 36.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
From Jigs and Reels to Schottisar och Polskor: Generating Scandinavian-like Folk Music with Deep Recurrent Networks2019Conference paper (Refereed)

The use of recurrent neural networks for modeling and generating music has been shown to be quite effective for compact, textual transcriptions of traditional music from Ireland and the UK. We explore how well these models perform for textual transcriptions of traditional music from Scandinavia. This type of music has characteristics that are similar to and different from that of Irish music, e.g., mode, rhythm, and structure. We investigate the effects of different architectures and training regimens, and evaluate the resulting models using three methods: a comparison of statistics between real and generated transcriptions, an appraisal of generated transcriptions via a semi-structured interview with an expert in Swedish folk music, and an ex- ercise conducted with students of Scandinavian folk music. We find that some of our models can generate new tran- scriptions sharing characteristics with Scandinavian folk music, but which often lack the simplicity of real transcrip- tions. One of our models has been implemented online at http://www.folkrnn.org for anyone to try.

• 37.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
Moglow: Probabilistic and controllable motion synthesis using normalising flows2019In: arXiv preprint arXiv:1905.06598Article in journal (Other academic)
• 38.
KTH, School of Electrical Engineering and Computer Science (EECS), Media Technology and Interaction Design, MID.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Department of Philosophy, University of Vienna, Vienna, Austria.
Ethical Dimensions of Music Information Retrieval Technology2018In: Transactions of the International Society for Music Information Retrieval, E-ISSN 2514-3298, Vol. 1, no 1, p. 44-55Article in journal (Refereed)

This article examines ethical dimensions of Music Information Retrieval (MIR) technology.  It uses practical ethics (especially computer ethics and engineering ethics) and socio-technical approaches to provide a theoretical basis that can inform discussions of ethics in MIR. To help ground the discussion, the article engages with concrete examples and discourse drawn from the MIR field. This article argues that MIR technology is not value-neutral but is influenced by design choices, and so has unintended and ethically relevant implications. These can be invisible unless one considers how the technology relates to wider society. The article points to the blurring of boundaries between music and technology, and frames music as “informationally enriched” and as a “total social fact.” The article calls attention to biases that are introduced by algorithms and data used for MIR technology, cultural issues related to copyright, and ethical problems in MIR as a scientific practice. The article concludes with tentative ethical guidelines for MIR developers, and calls for addressing key ethical problems with MIR technology and practice, especially those related to forms of bias and the remoteness of the technology development from end users.

• 39.
URPP Language and Space, University of Zurich, Switzerland.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. Department of comparative linguistics, University of Zurich, Switzerland. Department of computational linguistics, University of Zurich, Switzerland.
Fundamental frequency accommodation in multi-party human-robot game interactions: The effect of winning or losing2019In: Proceedings Interspeech 2019, International Speech Communication Association, 2019, p. 3980-3984Conference paper (Refereed)

In human-human interactions, the situational context plays a large role in the degree of speakers’ accommodation. In this paper, we investigate whether the degree of accommodation in a human-robot computer game is affected by (a) the duration of the interaction and (b) the success of the players in the game. 30 teams of two players played two card games with a conversational robot in which they had to find a correct order of five cards. After game 1, the players received the result of the game on a success scale from 1 (lowest success) to 5 (highest). Speakers’ fo accommodation was measured as the Euclidean distance between the human speakers and each human and the robot. Results revealed that (a) the duration of the game had no influence on the degree of fo accommodation and (b) the result of Game 1 correlated with the degree of fo accommodation in Game 2 (higher success equals lower Euclidean distance). We argue that game success is most likely considered as a sign of the success of players’ cooperation during the discussion, which leads to a higher accommodation behavior in speech.

• 40.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
On the influence of arching and material on the vibration of a shell - Towards understanding the soloist violin2018In: Vibrations in Physical Systems, ISSN 0860-6897, Vol. 29, article id 2018027Article in journal (Refereed)

A study of the results of FEM simulations of plate and shell models are presented to reference of a violin vibrations problems. The influence of arching, variable thickness and damping was considered. ABAQUS/Explicit procedure of “Dynamic Explicit” was used in the simulation. Anisotropy in the material properties (spruce) was considered (9 elastic constants).

• 41.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
Using Social and Physiological Signals for User Adaptation in Conversational Agents2019In: AAMAS '19: PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS, ASSOC COMPUTING MACHINERY , 2019, p. 2420-2422Conference paper (Refereed)

In face-to-face communication, humans subconsciously emit social signals which are picked up and used by their interlocutors as feedback for how well the previously communicated messages have been received. The feedback is then used in order to adapt the way the coming messages are being produced and sent to the interlocutor, leading to the communication to become as efficient and enjoyable as possible. Currently however, it is rare to find conversational agents utilizing this feedback channel for altering how the multimodal output is produced during interactions with users, largely due to the complex nature of the problem. In most regards, humans have a significant advantage over conversational agents in interpreting and acting on social signals. Humans are however restricted to a limited set of sensors, "the five senses", which conversational agents are not. This makes it possible for conversational agents to use specialized sensors to pick up physiological signals, such as skin temperature, respiratory rate or pupil dilation, which carry valuable information about the user with respect to the conversation. This thesis work aims at developing methods for utilizing both social and physiological signals emitted by humans in order to adapt the output of the conversational agent, allowing for an increase in conversation quality. These methods will primarily be based on automatically learning adaptive behavior from examples of real human interactions using machine learning methods.

• 42.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, Perception and Learning, RPL. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
Learning Non-verbal Behavior for a Social Robot from YouTube Videos2019Conference paper (Refereed)

Non-verbal behavior is crucial for positive perception of humanoid robots. If modeled well it can improve the interaction and leave the user with a positive experience, on the other hand, if it is modelled poorly it may impede the interaction and become a source of distraction. Most of the existing work on modeling non-verbal behavior show limited variability due to the fact that the models employed are deterministic and the generated motion can be perceived as repetitive and predictable. In this paper, we present a novel method for generation of a limited set of facial expressions and head movements, based on a probabilistic generative deep learning architecture called Glow. We have implemented a workflow which takes videos directly from YouTube, extracts relevant features, and trains a model that generates gestures that can be realized in a robot without any post processing. A user study was conducted and illustrated the importance of having any kind of non-verbal behavior while most differences between the ground truth, the proposed method, and a random control were not significant (however, the differences that were significant were in favor of the proposed method).

• 43.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. KTH. KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL. KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
Crowdsourcing a self-evolving dialog graph2019In: CUI '19: Proceedings of the 1st International Conference on Conversational User Interfaces, Association for Computing Machinery (ACM), 2019, article id 14Conference paper (Refereed)

In this paper we present a crowdsourcing-based approach for collecting dialog data for a social chat dialog system, which gradually builds a dialog graph from actual user responses and crowd-sourced system answers, conditioned by a given persona and other instructions. This approach was tested during the second instalment of the Amazon Alexa Prize 2018 (AP2018), both for the data collection and to feed a simple dialog system which would use the graph to provide answers. As users interacted with the system, a graph which maintained the structure of the dialogs was built, identifying parts where more coverage was needed. In an ofine evaluation, we have compared the corpus collected during the competition with other potential corpora for training chatbots, including movie subtitles, online chat forums and conversational data. The results show that the proposed methodology creates data that is more representative of actual user utterances, and leads to more coherent and engaging answers from the agent. An implementation of the proposed method is available as open-source code.

• 44.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. GAIPS INESC-ID, Lisbon, Portugal. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Multimodal Computing and Interaction, Saarland University, Germany. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
FARMI: A Framework for Recording Multi-Modal Interactions2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris: European Language Resources Association, 2018, p. 3969-3974Conference paper (Refereed)

In this paper we present (1) a processing architecture used to collect multi-modal sensor data, both for corpora collection and real-time processing, (2) an open-source implementation thereof and (3) a use-case where we deploy the architecture in a multi-party deception game, featuring six human players and one robot. The architecture is agnostic to the choice of hardware (e.g. microphones, cameras, etc.) and programming languages, although our implementation is mostly written in Python. In our use-case, different methods of capturing verbal and non-verbal cues from the participants were used. These were processed in real-time and used to inform the robot about the participants’ deceptive behaviour. The framework is of particular interest for researchers who are interested in the collection of multi-party, richly recorded corpora and the design of conversational systems. Moreover for researchers who are interested in human-robot interaction the available modules offer the possibility to easily create both autonomous and wizard-of-Oz interactions.

• 45.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL. KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL. KTH, Superseded Departments (pre-2005), Speech, Music and Hearing. KTH, Superseded Departments (pre-2005), Speech, Music and Hearing. KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. KTH, Superseded Departments (pre-2005), Numerical Analysis and Computer Science, NADA. KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
Machine Learning and Social Robotics for Detecting Early Signs of Dementia2017Other (Other academic)
• 46.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
Crowdsourced Multimodal Corpora Collection Tool2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 728-734Conference paper (Refereed)

In recent years, more and more multimodal corpora have been created. To our knowledge there is no publicly available tool which allows for acquiring controlled multimodal data of people in a rapid and scalable fashion. We therefore are proposing (1) a novel tool which will enable researchers to rapidly gather large amounts of multimodal data spanning a wide demographic range, and (2) an example of how we used this tool for corpus collection of our "Attentive listener'' multimodal corpus. The code is released under an Apache License 2.0 and available as an open-source repository, which can be found at https://github.com/kth-social-robotics/multimodal-crowdsourcing-tool. This tool will allow researchers to set-up their own multimodal data collection system quickly and create their own multimodal corpora. Finally, this paper provides a discussion about the advantages and disadvantages with a crowd-sourced data collection tool, especially in comparison to a lab recorded corpora.

• 47.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
SpaceRefNet: a neural approach to spatial reference resolution in a real city environment2019In: Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, Association for Computational Linguistics , 2019, p. 422-431Conference paper (Refereed)

Adding interactive capabilities to pedestrian wayfinding systems in the form of spoken dialogue will make them more natural to humans. Such an interactive wayfinding system needs to continuously understand and interpret pedestrian’s utterances referring to the spatial context. Achieving this requires the system to identify exophoric referring expressions in the utterances, and link these expressions to the geographic entities in the vicinity. This exophoric spatial reference resolution problem is difficult, as there are often several dozens of candidate referents. We present a neural network-based approach for identifying pedestrian’s references (using a network called RefNet) and resolving them to appropriate geographic objects (using a network called SpaceRefNet). Both methods show promising results beating the respective baselines and earlier reported results in the literature.

• 48.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
Multimodal Language Grounding for Human-Robot Collaboration: YRRSDS 2019 - Dimosthenis Kontogiorgos2019In: Young Researchers Roundtable on Spoken Dialogue Systems, 2019Conference paper (Refereed)
• 49.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. KTH. KTH. KTH. KTH. KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
The effects of anthropomorphism and non-verbal social behaviour in virtual assistants2019In: IVA 2019 - Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, Association for Computing Machinery (ACM), 2019, p. 133-140Conference paper (Refereed)

The adoption of virtual assistants is growing at a rapid pace. However, these assistants are not optimised to simulate key social aspects of human conversational environments. Humans are intellectually biased toward social activity when facing anthropomorphic agents or when presented with subtle social cues. In this paper, we test whether humans respond the same way to assistants in guided tasks, when in different forms of embodiment and social behaviour. In a within-subject study (N=30), we asked subjects to engage in dialogue with a smart speaker and a social robot.We observed shifting of interactive behaviour, as shown in behavioural and subjective measures. Our findings indicate that it is not always favourable for agents to be anthropomorphised or to communicate with nonverbal cues. We found a trade-off between task performance and perceived sociability when controlling for anthropomorphism and social behaviour.

• 50.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
Estimating Uncertainty in Task Oriented Dialogue2019In: ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction / [ed] Wen Gao, Helen Mei Ling Meng, Matthew Turk, Susan R. Fussell, ACM Digital Library, 2019, p. 414-418Conference paper (Refereed)

Situated multimodal systems that instruct humans need to handle user uncertainties, as expressed in behaviour, and plan their actions accordingly. Speakers’ decision to reformulate or repair previous utterances depends greatly on the listeners’ signals of uncertainty. In this paper, we estimate uncertainty in a situated guided task, as leveraged in non-verbal cues expressed by the listener, and predict that the speaker will reformulate their utterance. We use a corpus where people instruct how to assemble furniture, and extract their multimodal features. While uncertainty is in cases ver- bally expressed, most instances are expressed non-verbally, which indicates the importance of multimodal approaches. In this work, we present a model for uncertainty estimation. Our findings indicate that uncertainty estimation from non- verbal cues works well, and can exceed human annotator performance when verbal features cannot be perceived.

123 1 - 50 of 118
Cite
Citation style
• apa
• ieee
• modern-language-association-8th-edition
• vancouver
• Other style
More styles
Language
• de-DE
• en-GB
• en-US
• fi-FI
• nn-NO
• nn-NB
• sv-SE
• Other locale
More languages
Output format
• html
• text
• asciidoc
• rtf