Change search
Refine search result
1234567 1 - 50 of 1128
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Bisesi, Erica
    et al.
    Centre for Systematic Musicology, University of Graz, Graz, Austria.
    Friberg, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Parncutt, Richard
    Centre for Systematic Musicology, University of Graz, Graz, Austria.
    A Computational Model of Immanent Accent Salience in Tonal Music2019In: Frontiers in Psychology, ISSN 1664-1078, E-ISSN 1664-1078, Vol. 10, no 317, p. 1-19Article in journal (Refereed)
    Abstract [en]

    Accents are local musical events that attract the attention of the listener, and can be either immanent (evident from the score) or performed (added by the performer). Immanent accents involve temporal grouping (phrasing), meter, melody, and harmony; performed accents involve changes in timing, dynamics, articulation, and timbre. In the past, grouping, metrical and melodic accents were investigated in the context of expressive music performance. We present a novel computational model of immanent accent salience in tonal music that automatically predicts the positions and saliences of metrical, melodic and harmonic accents. The model extends previous research by improving on preliminary formulations of metrical and melodic accents and introducing a new model for harmonic accents that combines harmonic dissonance and harmonic surprise. In an analysis-by-synthesis approach, model predictions were compared with data from two experiments, respectively involving 239 sonorities and 638 sonorities, and 16 musicians and 5 experts in music theory. Average pair-wise correlations between raters were lower for metrical (0.27) and melodic accents (0.37) than for harmonic accents (0.49). In both experiments, when combining all the raters into a single measure expressing their consensus, correlations between ratings and model predictions ranged from 0.43 to 0.62. When different accent categories of accents were combined together, correlations were higher than for separate categories (r = 0.66). This suggests that raters might use strategies different from individual metrical, melodic or harmonic accent models to mark the musical events.

  • 2.
    Ternström, Sten
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Pabon, Peter
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Royal Conservatoire, The Hague, NL.
    Accounting for variability over the voice range2019In: Proceedings of the ICA 2019 and EAA Euroregio / [ed] Martin Ochmann, Michael Vorländer, Janina Fels, Aachen, DE: Deutsche Gesellschaft für Akustik (DEGA e.V.) , 2019, p. 4146-4151Conference paper (Refereed)
    Abstract [en]

    Researchers from the natural sciences interested in the performing arts often seek quantitative findings with explanatory power and practical relevance to performers and educators. However, the complexity of singing voice production continues to challenge us. On their own, entities that are readily measurable in the domain of physics are rarely of direct relevance to excellence in the domain of performance; because information on one level of representation (e.g., acoustic) is artistically meaningful mostly when interpreted in a context at a higher level of representation (e.g., emotional or semantic). Also, practically any acoustic or physiologic metric derived from the sound of a voice, or from other signals or images, will exhibit considerable variation both across individuals and across the voice range, from soft to loud or from low to high pitch. Here, we review some recent research based on the sampling paradigm of the voice field, also known as the voice range profile. Despite large inter-subject variation, the localizing by fo and SPL in the voice field will make the recorded values very reproducible within subjects. We demonstrate some technical possibilities, and argue the importance of making physical measurements that provide a more encompassing and individual-centric view of singing voice production.

  • 3.
    Kucherenko, Taras
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, Perception and Learning, RPL.
    Hasegawa, Dai
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kaneko, Naoshi
    Kjellström, Hedvig
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, Perception and Learning, RPL.
    Analyzing Input and Output Representations for Speech-Driven Gesture Generation2019In: 19th ACM International Conference on Intelligent Virtual Agents, New York, NY, USA: ACM Publications, 2019Conference paper (Refereed)
    Abstract [en]

    This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates.

    Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences.

    We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.

  • 4.
    Sturm, Bob
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Iglesias, Maria
    Joint Research Centre, European Commission.
    Ben-Tal, Oded
    Kingston University.
    Miron, Marius
    Joint Research Centre, European Commission.
    Gómez, Emilia
    Joint Research Centre, European Commission.
    Artificial Intelligence and Music: Open Questions of Copyright Law and Engineering Praxis2019In: MDPI Arts, ISSN 2076-0752, Vol. 8, no 3, article id 115Article in journal (Refereed)
    Abstract [en]

    The application of artificial intelligence (AI) to music stretches back many decades, and presents numerous unique opportunities for a variety of uses, such as the recommendation of recorded music from massive commercial archives, or the (semi-)automated creation of music. Due to unparalleled access to music data and effective learning algorithms running on high-powered computational hardware, AI is now producing surprising outcomes in a domain fully entrenched in human creativity—not to mention a revenue source around the globe. These developments call for a close inspection of what is occurring, and consideration of how it is changing and can change our relationship with music for better and for worse. This article looks at AI applied to music from two perspectives: copyright law and engineering praxis. It grounds its discussion in the development and use of a specific application of AI in music creation, which raises further and unanticipated questions. Most of the questions collected in this article are open as their answers are not yet clear at this time, but they are nonetheless important to consider as AI technologies develop and are applied more widely to music, not to mention other domains centred on human creativity.

  • 5. Saponaro, Giovanni
    et al.
    Jamone, Lorenzo
    Bernardino, Alexandre
    Salvi, Giampiero
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beyond the Self: Using Grounded Affordances to Interpret and Describe Others’ Actions2019In: IEEE Transactions on Cognitive and Developmental Systems, ISSN 2379-8920Article in journal (Refereed)
    Abstract [en]

    We propose a developmental approach that allows a robot to interpret and describe the actions of human agents by reusing previous experience. The robot first learns the association between words and object affordances by manipulating the objects in its environment. It then uses this information to learn a mapping between its own actions and those performed by a human in a shared environment. It finally fuses the information from these two models to interpret and describe human actions in light of its own experience. In our experiments, we show that the model can be used flexibly to do inference on different aspects of the scene. We can predict the effects of an action on the basis of object properties. We can revise the belief that a certain action occurred, given the observed effects of the human action. In an early action recognition fashion, we can anticipate the effects when the action has only been partially observed. By estimating the probability of words given the evidence and feeding them into a pre-defined grammar, we can generate relevant descriptions of the scene. We believe that this is a step towards providing robots with the fundamental skills to engage in social collaboration with humans.

  • 6.
    Per, Fallgren
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Bringing order to chaos: A non-sequential approach for browsing large sets of found audio data2019In: LREC 2018 - 11th International Conference on Language Resources and Evaluation, European Language Resources Association (ELRA) , 2019, p. 4307-4311Conference paper (Refereed)
    Abstract [en]

    We present a novel and general approach for fast and efficient non-sequential browsing of sound in large archives that we know little or nothing about, e.g. so called found data - data not recorded with the specific purpose to be analysed or used as training data. Our main motivation is to address some of the problems speech and speech technology researchers see when they try to capitalise on the huge quantities of speech data that reside in public archives. Our method is a combination of audio browsing through massively multi-object sound environments and a well-known unsupervised dimensionality reduction algorithm (SOM). We test the process chain on four data sets of different nature (speech, speech and music, farm animals, and farm animals mixed with farm sounds). The methods are shown to combine well, resulting in rapid and readily interpretable observations. Finally, our initial results are demonstrated in prototype software which is freely available.

  • 7.
    Rodríguez-Algarra, Francisco
    et al.
    Queen Mary University of London.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Dixon, Simon
    Queen Mary University of London.
    Characterising Confounding Effects in Music Classification Experiments through Interventions2019In: Transactions of the International Society for Music Information Retrieval, p. 52-66Article in journal (Refereed)
    Abstract [en]

    We address the problem of confounding in the design of music classification experiments, that is, the inability to distinguish the effects of multiple potential influencing variables in the measurements. Confounding affects the validity of conclusions at many levels, and so must be properly accounted for. We propose a procedure for characterising effects of confounding in the results of music classification experiments by creating regulated test conditions through interventions in the experimental pipeline, including a novel resampling strategy. We demonstrate this procedure on the GTZAN genre collection, which is known to give rise to confounding effects.

  • 8.
    Selamtzis, Andreas
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Castellana, Antonella
    Department of Electronics and Telecommunications, Politecnico di Torino, Italy.
    Salvi, Giampiero
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Carullo, Alessio
    Department of Electronics and Telecommunications, Politecnico di Torino, Italy.
    Astolfi, Arianna
    Department of Electronics and Telecommunications, Politecnico di Torino, Italy.
    Effect of vowel context in cepstral and entropy analysis of pathological voices2019In: Biomedical Signal Processing and Control, ISSN 1746-8094, E-ISSN 1746-8108, Vol. 47, p. 350-357Article in journal (Refereed)
    Abstract [en]

    This study investigates the effect of vowel context (excerpted from speech versus sustained) on two voice quality measures: the cepstral peak prominence smoothed (CPPS) and sample entropy (SampEn). Thirty-one dysphonic subjects with different types of organic dysphonia and thirty-one controls read a phonetically balanced text and phonated sustained [a:] vowels in comfortable pitch and loudness. All the [a:] vowels of the read text were excerpted by automatic speech recognition and phonetic (forced) alignment. CPPS and SampEn were calculated for all excerpted vowels of each subject, forming one distribution of CPPS and SampEn values per subject. The sustained vowels were analyzed using a 41 ms window, forming another distribution of CPPS and SampEn values per subject. Two speech-language pathologists performed a perceptual evaluation of the dysphonic subjects’ voice quality from the recorded text. The power of discriminating the dysphonic group from the controls for SampEn and CPPS was assessed for the excerpted and sustained vowels with the Receiver-Operator Characteristic (ROC) analysis. The best discrimination in terms of Area Under Curve (AUC) for CPPS occurred using the mean of the excerpted vowel distributions (AUC=0.85) and for SampEn using the 95th percentile of the sustained vowel distributions (AUC=0.84). CPPS and SampEn were found to be negatively correlated, and the largest correlation was found between the corresponding 95th percentiles of their distributions (Pearson, r=−0.83, p < 10−3). A strong correlation was also found between the 95th percentile of SampEn distributions and the perceptual quality of breathiness (Pearson, r=0.83, p < 10−3). The results suggest that depending on the acoustic voice quality measure, sustained vowels can be more effective than excerpted vowels for detecting dysphonia. Additionally, when using CPPS or SampEn there is an advantage of using the measures’ distributions rather than their average values.

  • 9.
    Lã, Filipa M.B.
    et al.
    University of Distance-Learning, MADRID, Spain.
    Ternström, Sten
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Music Acoustics.
    Flow ball-assisted training: immediate effects on vocal fold contacting2019In: Pan-European Voice Conference 2019 / [ed] Jenny Iwarsson, Stine Løvind Thorsen, University of Copenhagen , 2019, p. 50-51Conference paper (Refereed)
    Abstract [en]

    Background: The flow ball is a device that creates a static backpressure in the vocal tract while providing real-time visual feedback of airflow. A ball height of 0 to 10 cm corresponds to airflows of 0.2 to 0.4. L/s. These high airflows with low transglottal pressure correspond to low flow resistances, similar to the ones obtained when phonating into straws with 3.7 mm diameter and of 2.8 cm length. Objectives: To investigate whether there are immediate effects of flow ball-assisted training on vocal fold contact. Methods: Ten singers (five males and five females) performed a messa di voce at different pitches over one octave in three different conditions: before, during and after phonating with a flow ball. For all conditions, both audio and electrolaryngographic (ELG) signals were simultaneously recorded using a Laryngograph microprocessor. The vocal fold contact quotient Qci (the area under the normalized EGG cycle) and dEGGmaxN (the normalized maximum rate of change of vocal fold contact area) were obtained for all EGG cycles, using the FonaDyn system. We introduce also a compound metric Ic ,the ‘index of contact’ [Qci × log10(dEGGmaxN)], with the properties that it goes to zero at no contact. It combines information from both Qci and dEGGmaxN and thus it is comparable across subjects. The intra-subject means of all three metrics were computed and visualized by colour-coding over the fo-SPL plane, in cells of 1 semitone × 1 dB. Results: Overall, the use of flow ball-assisted phonation had a small yet significant effect on overall vocal fold contact across the whole messa di voce exercise. Larger effects were evident locally, i.e., in parts of the voice range. Comparing the pre-post flow-ball conditions, there were differences in Qci and/or dEGGmaxN. These differences were generally larger in male than in female voices. Ic typically decreased after flow ball use, for males but not for females. Conclusion: Flow ball-assisted training seems to modify vocal fold contacting gestures, especially in male singers.

  • 10.
    Hallström, Eric
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Mossmyr, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Vegeborn, Victor
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Wedin, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    From Jigs and Reels to Schottisar och Polskor: Generating Scandinavian-like Folk Music with Deep Recurrent Networks2019Conference paper (Refereed)
    Abstract [en]

    The use of recurrent neural networks for modeling and generating music has been shown to be quite effective for compact, textual transcriptions of traditional music from Ireland and the UK. We explore how well these models perform for textual transcriptions of traditional music from Scandinavia. This type of music has characteristics that are similar to and different from that of Irish music, e.g., mode, rhythm, and structure. We investigate the effects of different architectures and training regimens, and evaluate the resulting models using three methods: a comparison of statistics between real and generated transcriptions, an appraisal of generated transcriptions via a semi-structured interview with an expert in Swedish folk music, and an ex- ercise conducted with students of Scandinavian folk music. We find that some of our models can generate new tran- scriptions sharing characteristics with Scandinavian folk music, but which often lack the simplicity of real transcrip- tions. One of our models has been implemented online at http://www.folkrnn.org for anyone to try.

  • 11.
    Mishra, Saumitra
    et al.
    Queen Mary University of London.
    Stoller, Daniel
    Queen Mary University of London.
    Benetos, Emmanouil
    Queen Mary University of London.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Dixon, Simon
    Queen Mary University of London.
    GAN-Based Generation and Automatic Selection of Explanations for Neural Networks2019Conference paper (Refereed)
  • 12. Finkel, Sebastian
    et al.
    Veit, Ralf
    Lotze, Martin
    Friberg, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Vuust, Peter
    Soekadar, Surjo
    Birbaumer, Niels
    Kleber, Boris
    Intermittent theta burst stimulation over right somatosensory larynx cortex enhances vocal pitch‐regulation in nonsingers2019In: Human Brain Mapping, ISSN 1065-9471, E-ISSN 1097-0193Article in journal (Refereed)
    Abstract [en]

    While the significance of auditory cortical regions for the development and maintenance of speech motor coordination is well established, the contribution of somatosensory brain areas to learned vocalizations such as singing is less well understood. To address these mechanisms, we applied intermittent theta burst stimulation (iTBS), a facilitatory repetitive transcranial magnetic stimulation (rTMS) protocol, over right somatosensory larynx cortex (S1) and a nonvocal dorsal S1 control area in participants without singing experience. A pitch‐matching singing task was performed before and after iTBS to assess corresponding effects on vocal pitch regulation. When participants could monitor auditory feedback from their own voice during singing (Experiment I), no difference in pitch‐matching performance was found between iTBS sessions. However, when auditory feedback was masked with noise (Experiment II), only larynx‐S1 iTBS enhanced pitch accuracy (50–250 ms after sound onset) and pitch stability (>250 ms after sound onset until the end). Results indicate that somatosensory feedback plays a dominant role in vocal pitch regulation when acoustic feedback is masked. The acoustic changes moreover suggest that right larynx‐S1 stimulation affected the preparation and involuntary regulation of vocal pitch accuracy, and that kinesthetic‐proprioceptive processes play a role in the voluntary control of pitch stability in nonsingers. Together, these data provide evidence for a causal involvement of right larynx‐S1 in vocal pitch regulation during singing.

  • 13.
    Shore, Todd
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Androulakaki, Theofronia
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    KTH Tangrams: A Dataset for Research on Alignment and Conceptual Pacts in Task-Oriented Dialogue2019In: LREC 2018 - 11th International Conference on Language Resources and Evaluation, Tokyo, 2019, p. 768-775Conference paper (Refereed)
    Abstract [en]

    There is a growing body of research focused on task-oriented instructor-manipulator dialogue, whereby one dialogue participant initiates a reference to an entity in a common environment while the other participant must resolve this reference in order to manipulate said entity. Many of these works are based on disparate if nevertheless similar datasets. This paper described an English corpus of referring expressions in relatively free, unrestricted dialogue with physical features generated in a simulation, which facilitate analysis of dialogic linguistic phenomena regarding alignment in the formation of referring expressions known as conceptual pacts.

  • 14.
    Körner Gustafsson, Joakim
    et al.
    Karolinska Institutet.
    Södersten, Maria
    Ternström, Sten
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Music Acoustics.
    Schalling, Ellika
    Long-term effects of Lee Silverman Voice Treatment on daily voice use in Parkinson’s disease as measured with a portable voice accumulator2019In: Logopedics, Phoniatrics, Vocology, ISSN 1401-5439, E-ISSN 1651-2022, ISSN 1401-5439, Vol. 44, no 3, p. 124-133Article in journal (Refereed)
    Abstract [en]

    This study examines the effects of an intensive voice treatment focusing on increasing voice intensity, LSVT LOUD¯ Lee Silverman Voice Treatment, on voice use in daily life in a participant with Parkinson’s disease, using a portable voice accumulator, the VoxLog. A secondary aim was to compare voice use between the participant and a matched healthy control. Participants were an individual with Parkinson’s disease and his healthy monozygotic twin. Voice use was registered with the VoxLog during 9 weeks for the individual with Parkinson’s disease and 2 weeks for the control. This included baseline registrations for both participants, 4 weeks during LSVT LOUD for the individual with Parkinson’s disease and 1 week after treatment for both participants. For the participant with Parkinson’s disease, follow-up registrations at 3, 6, and 12 months post-treatment were made. The individual with Parkinson’s disease increased voice intensity during registrations in daily life with 4.1 dB post-treatment and 1.4 dB at 1-year follow-up compared to before treatment. When monitored during laboratory recordings an increase of 5.6 dB was seen post-treatment and 3.8 dB at 1-year follow-up. Changes in voice intensity were interpreted as a treatment effect as no significant correlations between changes in voice intensity and background noise were found for the individual with Parkinson’s disease. The increase in voice intensity in a laboratory setting was comparable to findings previously reported following LSVT LOUD. The increase registered using ambulatory monitoring in daily life was lower but still reflecting a clinically relevant change.

  • 15.
    Stefanov, Kalin
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Salvi, Giampiero
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kjellström, Hedvig
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, Perception and Learning, RPL.
    Beskow, Jonas
    KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.
    Modeling of Human Visual Attention in Multiparty Open-World Dialogues2019In: ACM TRANSACTIONS ON HUMAN-ROBOT INTERACTION, ISSN 2573-9522, Vol. 8, no 2, article id UNSP 8Article in journal (Refereed)
    Abstract [en]

    This study proposes, develops, and evaluates methods for modeling the eye-gaze direction and head orientation of a person in multiparty open-world dialogues, as a function of low-level communicative signals generated by his/hers interlocutors. These signals include speech activity, eye-gaze direction, and head orientation, all of which can be estimated in real time during the interaction. By utilizing these signals and novel data representations suitable for the task and context, the developed methods can generate plausible candidate gaze targets in real time. The methods are based on Feedforward Neural Networks and Long Short-Term Memory Networks. The proposed methods are developed using several hours of unrestricted interaction data and their performance is compared with a heuristic baseline method. The study offers an extensive evaluation of the proposed methods that investigates the contribution of different predictors to the accurate generation of candidate gaze targets. The results show that the methods can accurately generate candidate gaze targets when the person being modeled is in a listening state. However, when the person being modeled is in a speaking state, the proposed methods yield significantly lower performance.

  • 16.
    Skantze, Gabriel
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Multimodal Conversational Interaction with Robots2019In: The Handbook of Multimodal-Multisensor Interfaces, Volume 3: Language Processing, Software, Commercialization, and Emerging Directions / [ed] Sharon Oviatt, Björn Schuller, Philip R. Cohen, Daniel Sonntag, Gerasimos Potamianos, Antonio Krüger, ACM Press, 2019Chapter in book (Refereed)
  • 17.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Multimodal Language Grounding for Human-Robot Collaboration: YRRSDS 2019 - Dimosthenis Kontogiorgos2019In: Young Researchers Roundtable on Spoken Dialogue Systems, 2019Conference paper (Refereed)
  • 18.
    Ternström, Sten
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Normalized time-domain parameters for electroglottographic waveforms2019In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 146, no 1, p. EL65-EL70, article id 1.5117174Article in journal (Refereed)
    Abstract [en]

    The electroglottographic waveform is of interest for characterizing phonation non-invasively. Existing parameterizations tend to give disparate results because they rely on somewhat arbitrary thresholds and/or contacting events. It is shown that neither are needed for formulating a normalized contact quotient and a normalized peak derivative. A heuristic combination of the two resolves also the ambiguity of a moderate contact quotient, with regard to vocal fold contacting being firm versus weak or absent. As preliminaries, schemes for electroglottography signal preconditioning and time-domain period detection are described that improve somewhat on similar methods. The algorithms are simple and compute quickly.

  • 19.
    Kucherenko, Taras
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Hasegawa, Dai
    Hokkai Gakuen University, Sapporo, Japan.
    Naoshi, Kaneko
    Aoyama Gakuin University, Sagamihara, Japan.
    Henter, Gustav Eje
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kjellström, Hedvig
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    On the Importance of Representations for Speech-Driven Gesture Generation: Extended Abstract2019Conference paper (Refereed)
    Abstract [en]

    This paper presents a novel framework for automatic speech-driven gesture generation applicable to human-agent interaction, including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech features as input and produces gestures in the form of sequences of 3D joint coordinates representing motion as output. The results of objective and subjective evaluations confirm the benefits of the representation learning.

  • 20.
    Friberg, Anders
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Bisesi, Erica
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Inst Pasteur, France.
    Addessi, Anna Rita
    Univ Bologna, Dept Educ Studies, Bologna, Italy..
    Baroni, Mario
    Univ Bologna, Dept Arts, Bologna, Italy..
    Probing the Underlying Principles of Perceived Immanent Accents Using a Modeling Approach2019In: Frontiers in Psychology, ISSN 1664-1078, E-ISSN 1664-1078, Vol. 10, article id 1024Article in journal (Refereed)
    Abstract [en]

    This article deals with the question of how the perception of the "immanent accents" can be predicted and modeled. By immanent accent we mean any musical event in the score that is related to important points in the musical structure (e.g., tactus positions, melodic peaks) and is therefore able to capture the attention of a listener. Our aim was to investigate the underlying principles of these accented notes by combining quantitative modeling, music analysis and experimental methods. A listening experiment was conducted where 30 participants indicated perceived accented notes for 60 melodies, vocal and instrumental, selected from Baroque, Romantic and Posttonal styles. This produced a large and unique collection of perceptual data about the perceived immanent accents, organized by styles consisting of vocal and instrumental melodies within Western art music. The music analysis of the indicated accents provided a preliminary list of musical features that could be identified as possible reasons for the raters' perception of the immanent accents. These features related to the score in different ways, e.g., repeated fragments, single notes, or overall structure. A modeling approach was used to quantify the influence of feature groups related to pitch contour, tempo, timing, simple phrasing, and meter. A set of 43 computational features was defined from the music analysis and previous studies and extracted from the score representation. The mean ratings of the participants were predicted using multiple linear regression and support vector regression. The latter method (using cross-validation) obtained the best result of about 66% explained variance (r = 0.81) across all melodies and for a selected group of raters. The independent contribution of each feature group was relatively high for pitch contour and timing (9.6 and 7.0%). There were also significant contributions from tempo (4.5%), simple phrasing (4.4%), and meter (3.9%). Interestingly, the independent contribution varied greatly across participants, implying different listener strategies, and also some variability across different styles. The large differences among listeners emphasize the importance of considering the individual listener's perception in future research in music perception.

  • 21.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    The Effects of Embodiment and Social Eye-Gaze in Conversational Agents2019In: Proceedings of the 41st Annual Conference of the Cognitive Science Society (CogSci), 2019Conference paper (Refereed)
    Abstract [en]

    The adoption of conversational agents is growing at a rapid pace. Agents however, are not optimised to simulate key social aspects of situated human conversational environments. Humans are intellectually biased towards social activity when facing more anthropomorphic agents or when presented with subtle social cues. In this work, we explore the effects of simulating anthropomorphism and social eye-gaze in three conversational agents. We tested whether subjects’ visual attention would be similar to agents in different forms of embodiment and social eye-gaze. In a within-subject situated interaction study (N=30), we asked subjects to engage in task-oriented dialogue with a smart speaker and two variations of a social robot. We observed shifting of interactive behaviour by human users, as shown in differences in behavioural and objective measures. With a trade-off in task performance, social facilitation is higher with more anthropomorphic social agents when performing the same task.

  • 22.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    The Trade-off between Interaction Time and Social Facilitation with Collaborative Social Robots2019In: The Challenges of Working on Social Robots that Collaborate with People, 2019Conference paper (Refereed)
    Abstract [en]

    The adoption of social robots and conversational agents is growing at a rapid pace. These agents, however, are still not optimised to simulate key social aspects of situated human conversational environments. Humans are intellectually biased towards social activity when facing more anthropomorphic agents or when presented with subtle social cues. In this paper, we discuss the effects of simulating anthropomorphism and non-verbal social behaviour in social robots and its implications for human-robot collaborative guided tasks. Our results indicate that it is not always favourable for agents to be anthropomorphised or to communicate with nonverbal behaviour. We found a clear trade-off between interaction time and social facilitation when controlling for anthropomorphism and social behaviour.

  • 23.
    Selamtzis, Andreas
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Ternström, Sten
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Richter, Bernard
    Burk, Fabian
    Köberlein, Maria
    Echternach, Matthias
    A comparison of electroglottographic and glottal area waveforms for phonation type differentiation in male professional singers2018Manuscript (preprint) (Other academic)
    Abstract [en]

    This study investigates the use of glottographic signals (EGG and GAW) to study phonation in different vibratory states as produced by professionally trained singers. Six western classical tenors were asked to phonate pitch glides from modal to falsetto phonation, or modal to their stage voice above the passaggio (SVaP). For each pitch glide the sample entropy (SampEn) of the EGG signal was calculated to establish a “ground truth” for the performed phonation type; the cycles before the maximum SampEn peak were labeled as modal, and the cycles after the peak as falsetto, or SVaP. Three classifications of vibratory state were performed using clustering: one based only on the EGG, one based on the GAW, and one based on their combi- nation. The classification error rate (clustering vs ground truth) was on average smaller than 10%, for any of the three settings, revealing no special advantage of the GAW over EGG, and vice versa. The EGG-based time domain metric analysis revealed a larger contact quotient and larger normalized EGG derivative peak ratio in modal, compared to SVaP and falsetto. The glottographic waveform comparison of SVaP with falsetto and modal suggests that SVaP resembles more falsetto than modal, though with a larger contact quotient. 

  • 24.
    Selamtzis, Andreas
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Ternström, Sten
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Music Acoustics.
    Richter, Bernard
    Burk, Fabian
    Köberlein, Marie
    Echternach, Matthias
    A comparison of electroglottographic and glottal area waveforms for phonation type differentiation in male professional singers2018In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, ISSN 0001-4966, Vol. 144, no 6, p. 3275-3288Article in journal (Refereed)
    Abstract [en]

    This study compares the use of electroglottograms (EGGs) and glottal area waveforms (GAWs) to study phonation in different vibratory states as produced by professionally trained singers. Six western classical tenors were asked to phonate pitch glides from modal to falsetto phonation, or from modal to their stage voice above the passaggio (SVaP). For each pitch glide the sample entropy (SampEn) of the EGG signal was calculated to detect the occurrence of phonatory instabilities and establish a ᅵground truthᅵ for the performed phonation type. The cycles before the maximum SampEn were labeled as modal, and the cycles after the peak were labeled as either falsetto, or SVaP. Three automatic categorizations of vibratory state were performed using clustering: one based only on the EGG, one based on the GAW, and one based on their combination. The error rate (clustering vs ground truth) was, on average, lower than 10% for all of the three settings, revealing no special advantage of the GAW over EGG, and vice vers...

  • 25.
    Sibirtseva, Elena
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Nykvist, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Karaoguz, Hakan
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Leite, Iolanda
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kragic, Danica
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    A Comparison of Visualisation Methods for Disambiguating Verbal Requests in Human-Robot Interaction2018In: 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN) 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), 2018Conference paper (Refereed)
    Abstract [en]

    Picking up objects requested by a human user is a common task in human-robot interaction. When multiple objects match the user's verbal description, the robot needs to clarify which object the user is referring to before executing the action. Previous research has focused on perceiving user's multimodal behaviour to complement verbal commands or minimising the number of follow up questions to reduce task time. In this paper, we propose a system for reference disambiguation based on visualisation and compare three methods to disambiguate natural language instructions. In a controlled experiment with a YuMi robot, we investigated realtime augmentations of the workspace in three conditions - head-mounted display, projector, and a monitor as the baseline - using objective measures such as time and accuracy, and subjective measures like engagement, immersion, and display interference. Significant differences were found in accuracy and engagement between the conditions, but no differences were found in task time. Despite the higher error rates in the head-mounted display condition, participants found that modality more engaging than the other two, but overall showed preference for the projector condition over the monitor and head-mounted display conditions.

  • 26.
    Hultén, Magnus
    et al.
    Linköpings universitet.
    Artman, Henrik
    KTH, School of Computer Science and Communication (CSC), Media Technology and Interaction Design, MID.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    A model to analyse students’ cooperative ideageneration in conceptual design2018In: International journal of technology and design education, ISSN 0957-7572, E-ISSN 1573-1804, Vol. 28, no 2, p. 451-470Article in journal (Refereed)
    Abstract [en]

    In this article we focus on the co-creation of ideas. Through the use of concepts from collaborative learning and communication theory we suggest a model that will enable the cooperative nature of creative design tasks to emerge. Four objectives of the model are stated and elaborated on in the paper: that the model should be anchored in previous research; that it should allow for collaborative aspects of creative design to be accounted for; that it should address the mechanisms by which new ideas are generated, embraced and cultivated during actual design; and that it should have a firm theoretical grounding. The model is also exemplified by two test sessions where two student pairs perform a time-constrained design task. We hope that the model can play a role both as an educational tool to be used by students and a teacher in design education, but primarily as a model to analyse students' cooperative idea generation in conceptual design.

  • 27.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Avramova, Vanya
    KTH.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Jonell, Patrik
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Oertel, Catharine
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 119-127Conference paper (Refereed)
    Abstract [en]

    In this paper we present a corpus of multiparty situated interaction where participants collaborated on moving virtual objects on a large touch screen. A moderator facilitated the discussion and directed the interaction. The corpus contains recordings of a variety of multimodal data, in that we captured speech, eye gaze and gesture data using a multisensory setup (wearable eye trackers, motion capture and audio/video). Furthermore, in the description of the multimodal corpus, we investigate four different types of social gaze: referential gaze, joint attention, mutual gaze and gaze aversion by both perspectives of a speaker and a listener. We annotated the groups’ object references during object manipulation tasks and analysed the group’s proportional referential eye-gaze with regards to the referent object. When investigating the distributions of gaze during and before referring expressions we could corroborate the differences in time between speakers’ and listeners’ eye gaze found in earlier studies. This corpus is of particular interest to researchers who are interested in social eye-gaze patterns in turn-taking and referring language in situated multi-party interaction.

  • 28.
    Fallgren, P.
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Malisz, Z.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Edlund, Jens
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    A tool for exploring large amounts of found audio data2018In: CEUR Workshop Proceedings, CEUR-WS , 2018, p. 499-503Conference paper (Refereed)
    Abstract [en]

    We demonstrate a method and a set of open source tools (beta) for nonsequential browsing of large amounts of audio data. The demonstration will contain versions of a set of functionalities in their first stages, and will provide a good insight in how the method can be used to browse through large quantities of audio data efficiently.

  • 29.
    Chettri, Bhusan
    et al.
    Queen Mary Univ London, Sch EECS, London, England..
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Queen Mary Univ London, Sch EECS, London, England..
    Benetos, Emmanouil
    Queen Mary Univ London, Sch EECS, London, England..
    ANALYSING REPLAY SPOOFING COUNTERMEASURE PERFORMANCE UNDER VARIED CONDITIONS2018In: 2018 IEEE 28TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP) / [ed] Pustelnik, N Ma, Z Tan, ZH Larsen, J, IEEE , 2018Conference paper (Refereed)
    Abstract [en]

    In this paper, we aim to understand what makes replay spoofing detection difficult in the context of the ASVspoof 2017 corpus. We use FFT spectra, mel frequency cepstral coefficients (MFCC) and inverted MFCC (IMFCC) frontends and investigate different back-ends based on Convolutional Neural Networks (CNNs), Gaussian Mixture Models (GMMs) and Support Vector Machines (SVMs). On this database, we find that IMFCC frontend based systems show smaller equal error rate (EER) for high quality replay attacks but higher EER for low quality replay attacks in comparison to the baseline. However, we find that it is not straightforward to understand the influence of an acoustic environment (AE), a playback device (PD) and a recording device (RD) of a replay spoofing attack. One reason is the unavailability of metadata for genuine recordings. Second, it is difficult to account for the effects of the factors: AE, PD and RD, and their interactions. Finally, our frame-level analysis shows that the presence of cues (recording artefacts) in the first few frames of genuine signals (missing from replayed ones) influence class prediction.

  • 30.
    Chettri, Bhusan
    et al.
    Queen Mary Univ London, Sch EECS, London, England..
    Mishra, Saumitra
    Queen Mary Univ London, Sch EECS, London, England..
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Queen Mary Univ London, Sch EECS, London, England..
    Analysing the predictions of a CNN-based replay spoofing detection system2018In: 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), IEEE , 2018, p. 92-97Conference paper (Refereed)
    Abstract [en]

    Playing recorded speech samples of an enrolled speaker – “replay attack” – is a simple approach to bypass an automatic speaker ver- ification (ASV) system. The vulnerability of ASV systems to such attacks has been acknowledged and studied, but there has been no research into what spoofing detection systems are actually learning to discriminate. In this paper, we analyse the local behaviour of a replay spoofing detection system based on convolutional neural net- works (CNNs) adapted from a state-of-the-art CNN (LC N NF F T ) submitted at the ASVspoof 2017 challenge. We generate tempo- ral and spectral explanations for predictions of the model using the SLIME algorithm. Our findings suggest that in most instances of spoofing the model is using information in the first 400 milliseconds of each audio instance to make the class prediction. Knowledge of the characteristics that spoofing detection systems are exploiting can help build less vulnerable ASV systems, other spoofing detection systems, as well as better evaluation databases.

  • 31.
    Näslund, Per
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Artificial Neural Networks in Swedish Speech Synthesis2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Text-to-speech (TTS) systems have entered our daily lives in the form of smart assistants and many other applications. Contemporary re- search applies machine learning and artificial neural networks (ANNs) to synthesize speech. It has been shown that these systems outperform the older concatenative and parametric methods.

    In this paper, ANN-based methods for speech synthesis are ex- plored and one of the methods is implemented for the Swedish lan- guage. The implemented method is dubbed “Tacotron” and is a first step towards end-to-end ANN-based TTS which puts many differ- ent ANN-techniques to work. The resulting system is compared to a parametric TTS through a strength-of-preference test that is carried out with 20 Swedish speaking subjects. A statistically significant pref- erence for the ANN-based TTS is found. Test subjects indicate that the ANN-based TTS performs better than the parametric TTS when it comes to audio quality and naturalness but sometimes lacks in intelli- gibility.

  • 32.
    Dabbaghchian, Saeed
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Computational Modeling of the Vocal Tract: Applications to Speech Production2018Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Human speech production is a complex process, involving neuromuscular control signals, the effects of articulators' biomechanical properties and acoustic wave propagation in a vocal tract tube of intricate shape. Modeling these phenomena may play an important role in advancing our understanding of the involved mechanisms, and may also have future medical applications, e.g., guiding doctors in diagnosing, treatment planning, and surgery prediction of related disorders, ranging from oral cancer, cleft palate, obstructive sleep apnea, dysphagia, etc.

    A more complete understanding requires models that are as truthful representations as possible of the phenomena. Due to the complexity of such modeling, simplifications have nevertheless been used extensively in speech production research: phonetic descriptors (such as the position and degree of the most constricted part of the vocal tract) are used as control signals, the articulators are represented as two-dimensional geometrical models, the vocal tract is considered as a smooth tube and plane wave propagation is assumed, etc.

    This thesis aims at firstly investigating the consequences of such simplifications, and secondly at contributing to establishing unified modeling of the speech production process, by connecting three-dimensional biomechanical modeling of the upper airway with three-dimensional acoustic simulations. The investigation on simplifying assumptions demonstrated the influence of vocal tract geometry features — such as shape representation, bending and lip shape — on its acoustic characteristics, and that the type of modeling — geometrical or biomechanical — affects the spatial trajectories of the articulators, as well as the transition of formant frequencies in the spectrogram.

    The unification of biomechanical and acoustic modeling in three-dimensions allows to realistically control the acoustic output of dynamic sounds, such as vowel-vowel utterances, by contraction of relevant muscles. This moves and shapes the speech articulators that in turn dene the vocal tract tube in which the wave propagation occurs. The main contribution of the thesis in this line of work is a novel and complex method that automatically reconstructs the shape of the vocal tract from the biomechanical model. This step is essential to link biomechanical and acoustic simulations, since the vocal tract, which anatomically is a cavity enclosed by different structures, is only implicitly defined in a biomechanical model constituted of several distinct articulators.

  • 33.
    Jonell, Patrik
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Oertel, Catharine
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Crowdsourced Multimodal Corpora Collection Tool2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 728-734Conference paper (Refereed)
    Abstract [en]

    In recent years, more and more multimodal corpora have been created. To our knowledge there is no publicly available tool which allows for acquiring controlled multimodal data of people in a rapid and scalable fashion. We therefore are proposing (1) a novel tool which will enable researchers to rapidly gather large amounts of multimodal data spanning a wide demographic range, and (2) an example of how we used this tool for corpus collection of our "Attentive listener'' multimodal corpus. The code is released under an Apache License 2.0 and available as an open-source repository, which can be found at https://github.com/kth-social-robotics/multimodal-crowdsourcing-tool. This tool will allow researchers to set-up their own multimodal data collection system quickly and create their own multimodal corpora. Finally, this paper provides a discussion about the advantages and disadvantages with a crowd-sourced data collection tool, especially in comparison to a lab recorded corpora.

  • 34.
    Fermoselle, Leonor
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Designing joint attention systems for robots that assist children with autism spectrum disorders2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Joint attention behaviours play a central role in natural and believable human-robot interactions. This research presents the design decisions of a semi-autonomous joint attention robotic system, together with the evaluation of its effectiveness and perceived social presence across different cognitive ability groups. For this purpose, two different studies were carried out: first with adults, and then with children between 10 and 12 years-old.

    The overall results for both studies reflect a system that is perceived as socially present and engaging which can successfully establish joint attention with the participants. When comparing the performance results between the two groups, children achieved higher joint attention scores and reported a higher level of enjoyment and helpfulness in the interaction.

    Furthermore, a detailed literature review on robot-assisted therapies for children with autism spectrum disorders is presented, focusing on the development of joint attention skills. The children’s positive interaction results from the studies, together with state-of-the-art research therapies and the input from an autism therapist, guided the author to elaborate some design guidelines for a robotic system to assist in joint attention focused autism therapies.

  • 35.
    Li, Chengjie
    et al.
    KTH.
    Androulakaki, Theofronia
    KTH.
    Gao, Alex Yuan
    Yang, Fangkai
    KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
    Saikia, Himangshu
    KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
    Peters, Christopher
    KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Effects of Posture and Embodiment on Social Distance in Human-Agent Interaction in Mixed Reality2018In: Proceedings of the 18th International Conference on Intelligent Virtual Agents, ACM Digital Library, 2018, p. 191-196Conference paper (Refereed)
    Abstract [en]

    Mixed reality offers new potentials for social interaction experiences with virtual agents. In addition, it can be used to experiment with the design of physical robots. However, while previous studies have investigated comfortable social distances between humans and artificial agents in real and virtual environments, there is little data with regards to mixed reality environments. In this paper, we conducted an experiment in which participants were asked to walk up to an agent to ask a question, in order to investigate the social distances maintained, as well as the subject's experience of the interaction. We manipulated both the embodiment of the agent (robot vs. human and virtual vs. physical) as well as closed vs. open posture of the agent. The virtual agent was displayed using a mixed reality headset. Our experiment involved 35 participants in a within-subject design. We show that, in the context of social interactions, mixed reality fares well against physical environments, and robots fare well against humans, barring a few technical challenges.

  • 36.
    Ternström, Sten
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    D'Amario, Sara
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. University of York.
    Selamtzis, Andreas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Effects of the lung volume on the electroglottographic waveform in trained female singers2018In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588Article in journal (Refereed)
    Abstract [en]

    Objectives: To determine if in singing there is an effect of lung volume on the electroglottographic waveform, and if so, how it varies over the voice range. Study design: Eight trained female singers sang the tune “Frère Jacques” in 18 conditions: three phonetic contexts, three dynamic levels, and high or low lung volume. Conditions were randomized and replicated. Methods: The audio and EGG signals were recorded in synchrony with signals tracking respiration and vertical larynx position. The first 10 Fourier descriptors of every EGG cycle were computed. These spectral data were clustered statistically, and the clusters were mapped by color into a voice range profile display, thus visualizing the EGG waveform changes under the influence of fo and SPL. The rank correlations and effect sizes of the relationships between relative lung volume and several adduction-related EGG wave shape metrics were similarly rendered on a color scale, in voice range profile-style ʻvoice maps.ʼ Results: In most subjects, EGG waveforms varied considerably over the voice range. Within subjects, reproducibility was high, not only across the replications, but also across the phonetic contexts. The EGG waveforms were quite individual, as was the nature of the EGG shape variation across the range. EGG metrics were significantly correlated to changes in lung volume, in parts of the range of the song, and in most subjects. However, the effect sizes of the relative lung volume were generally much smaller than the effects of fo and SPL, and the relationships always varied, even changing polarity from one part of the range to another. Conclusions: Most subjects exhibited small, reproducible effects of the relative lung volume on the EGG waveform. Some hypothesized influences of tracheal pull were seen, mostly at the lowest SPLs. The effects were however highly variable, both across the moderately wide fo-SPL range and across subjects. Different singers may be applying different techniques and compensatory behaviors with changing lung volume. The outcomes emphasize the importance of making observations over a substantial part of the voice range, and not only of phonations sustained at a few fundamental frequencies and sound levels.

  • 37.
    Holzapfel, André
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Media Technology and Interaction Design, MID.
    Sturm, Bob
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Coeckelbergh, Mark
    Department of Philosophy, University of Vienna, Vienna, Austria.
    Ethical Dimensions of Music Information Retrieval Technology2018In: Transactions of the International Society for Music Information Retrieval, E-ISSN 2514-3298, Vol. 1, no 1, p. 44-55Article in journal (Refereed)
    Abstract [en]

    This article examines ethical dimensions of Music Information Retrieval (MIR) technology.  It uses practical ethics (especially computer ethics and engineering ethics) and socio-technical approaches to provide a theoretical basis that can inform discussions of ethics in MIR. To help ground the discussion, the article engages with concrete examples and discourse drawn from the MIR field. This article argues that MIR technology is not value-neutral but is influenced by design choices, and so has unintended and ethically relevant implications. These can be invisible unless one considers how the technology relates to wider society. The article points to the blurring of boundaries between music and technology, and frames music as “informationally enriched” and as a “total social fact.” The article calls attention to biases that are introduced by algorithms and data used for MIR technology, cultural issues related to copyright, and ethical problems in MIR as a scientific practice. The article concludes with tentative ethical guidelines for MIR developers, and calls for addressing key ethical problems with MIR technology and practice, especially those related to forms of bias and the remoteness of the technology development from end users.

  • 38.
    Jonell, Patrik
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Mattias, Bystedt
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Per, Fallgren
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Kontogiorgos, Dimosthenis
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    David Aguas Lopes, José
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Malisz, Zofia
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Samuel, Mascarenhas
    GAIPS INESC-ID, Lisbon, Portugal.
    Oertel, Catharine
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Eran, Raveh
    Multimodal Computing and Interaction, Saarland University, Germany.
    Shore, Todd
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    FARMI: A Framework for Recording Multi-Modal Interactions2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris: European Language Resources Association, 2018, p. 3969-3974Conference paper (Refereed)
    Abstract [en]

    In this paper we present (1) a processing architecture used to collect multi-modal sensor data, both for corpora collection and real-time processing, (2) an open-source implementation thereof and (3) a use-case where we deploy the architecture in a multi-party deception game, featuring six human players and one robot. The architecture is agnostic to the choice of hardware (e.g. microphones, cameras, etc.) and programming languages, although our implementation is mostly written in Python. In our use-case, different methods of capturing verbal and non-verbal cues from the participants were used. These were processed in real-time and used to inform the robot about the participants’ deceptive behaviour. The framework is of particular interest for researchers who are interested in the collection of multi-party, richly recorded corpora and the design of conversational systems. Moreover for researchers who are interested in human-robot interaction the available modules offer the possibility to easily create both autonomous and wizard-of-Oz interactions.

  • 39.
    Pabon, Peter
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics. Royal Conservatoire, The Hague, Netherlands.
    Ternström, Sten
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.
    Feature maps of the acoustic spectrum of the voice2018In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588Article in journal (Refereed)
    Abstract [en]

    The change in the spectrum of sustained /a/ vowels was mapped over the voice range from low to high fundamentalfrequency and low to high sound pressure level (SPL), in the form of the so-called voice range profile (VRP). In eachinterval of one semitone and one decibel, narrowband spectra were averaged both within and across subjects. Thesubjects were groups of 7 male and 12 female singing students, as well as a group of 16 untrained female voices. Foreach individual and also for each group, pairs of VRP recordings were made, with stringent separation of themodal/chest and falsetto/head registers. Maps are presented of eight scalar metrics, each of which was chosen toquantify a particular feature of the voice spectrum, over fundamental frequency and SPL. Metrics 1 and 2 chart the roleof the fundamental in relation to the rest of the spectrum. Metrics 3 and 4 are used to explore the role of resonances inrelation to SPL. Metrics 5 and 6 address the distribution of high frequency energy, while metrics 7 and 8 seek todescribe the distribution of energy at the low end of the voice spectrum. Several examples are observed ofphenomena that are difficult to predict from linear source-filter theory, and of the voice source being less uniform overthe voice range than is conventionally assumed. These include a high-frequency band-limiting at high SPL and anunexpected persistence of the second harmonic at low SPL. The two voice registers give rise to clearly different maps.Only a few effects of training were observed, in the low frequency end below 2 kHz. The results are of potentialinterest in voice analysis, voice synthesis and for new insights into the voice production mechanism.

  • 40.
    Sundberg, Johan
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Flow Glottogram and Subglottal Pressure Relationship in Singers and Untrained Voices2018In: Journal of Voice, ISSN 0892-1997, E-ISSN 1873-4588, Vol. 32, no 1, p. 23-31Article in journal (Refereed)
    Abstract [en]

    This article combines results from three earlier investigations of the glottal voice source during phonation at varying degrees of vocal loudness (1) in five classically trained baritone singers (Sundberg et al., 1999), (2) in 15 female and 14 male untrained voices (Sundberg et al., 2005), and (3) in voices rated as hyperfunctional by an expert panel (Millgard et al., 2015). Voice source data were obtained by inverse filtering. Associated subglottal pressures were estimated from oral pressure during the occlusion for the consonant /p/. Five flow glottogram parameters, (1) maximum flow declination rate (MFDR), (2) peak-to-peak pulse amplitude, (3) level difference between the first and the second harmonics of the voice source, (4) closed quotient, and (5) normalized amplitude quotient, were averaged across the singer subjects and related to associated MFDR values. Strong, quantitative relations, expressed as equations, are found between subglottal pressure and MFDR and between MFDR and each of the other flow glottogram parameters. The values for the untrained voices, as well as those for the voices rated as hyperfunctional, deviate systematically from the values derived from the equations.

  • 41.
    Ternström, Sten
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Music Acoustics.
    Johansson, Dennis
    Selamtzis, Andreas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    FonaDyn - A system for real-time analysis of the electroglottogram, over the voice range2018In: Software Quality Professional, ISSN 1522-0540, SoftwareX, ISSN 2352-7110, Vol. 7, p. 74-80Article in journal (Refereed)
    Abstract [en]

    From soft to loud and low to high, the mechanisms of human voice have many degrees of freedom, making it difficult to assess phonation from the acoustic signal alone. FonaDyn is a research tool that combines acoustics with electroglottography (EGG). It characterizes and visualizes in real time the dynamics of EGG waveforms, using statistical clustering of the cycle-synchronous EGG Fourier components, and their sample entropy. The prevalence and stability of different EGG waveshapes are mapped as colored regions into a so-called voice range profile, without needing pre-defined thresholds or categories. With appropriately ‘trained’ clusters, FonaDyn can classify and map voice regimes. This is of potential scientific, clinical and pedagogical interest.

  • 42.
    Kragic, Danica
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Karaoǧuz, Hakan
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Jensfelt, Patric
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Krug, Robert
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Interactive, collaborative robots: Challenges and opportunities2018In: IJCAI International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence , 2018, p. 18-25Conference paper (Refereed)
    Abstract [en]

    Robotic technology has transformed manufacturing industry ever since the first industrial robot was put in use in the beginning of the 60s. The challenge of developing flexible solutions where production lines can be quickly re-planned, adapted and structured for new or slightly changed products is still an important open problem. Industrial robots today are still largely preprogrammed for their tasks, not able to detect errors in their own performance or to robustly interact with a complex environment and a human worker. The challenges are even more serious when it comes to various types of service robots. Full robot autonomy, including natural interaction, learning from and with human, safe and flexible performance for challenging tasks in unstructured environments will remain out of reach for the foreseeable future. In the envisioned future factory setups, home and office environments, humans and robots will share the same workspace and perform different object manipulation tasks in a collaborative manner. We discuss some of the major challenges of developing such systems and provide examples of the current state of the art.

  • 43.
    Peters, Christopher
    et al.
    KTH.
    Li, Chengjie
    KTH.
    Yang, Fangkai
    KTH.
    Avramova, Vanya
    KTH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Investigating Social Distances between Humans, Virtual Humans and Virtual Robots in Mixed Reality2018In: Proceedings of 17th International Conference on Autonomous Agents and MultiAgent Systems, 2018, p. 2247-2249Conference paper (Refereed)
    Abstract [en]

    Mixed reality environments offer new potentials for the design of compelling social interaction experiences with virtual characters. In this paper, we summarise initial experiments we are conducting in which we measure comfortable social distances between humans, virtual humans and virtual robots in mixed reality environments. We consider a scenario in which participants walk within a comfortable distance of a virtual character that has its appearance varied between a male and female human, and a standard- and human-height virtual Pepper robot. Our studies in mixed reality thus far indicate that humans adopt social zones with artificial agents that are similar in manner to human-human social interactions and interactions in virtual reality.

  • 44. Roddy, M.
    et al.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Harte, N.
    Investigating speech features for continuous turn-taking prediction using LSTMs2018In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, International Speech Communication Association, 2018, p. 586-590Conference paper (Refereed)
    Abstract [en]

    For spoken dialog systems to conduct fluid conversational interactions with users, the systems must be sensitive to turn-taking cues produced by a user. Models should be designed so that effective decisions can be made as to when it is appropriate, or not, for the system to speak. Traditional end-of-turn models, where decisions are made at utterance end-points, are limited in their ability to model fast turn-switches and overlap. A more flexible approach is to model turn-taking in a continuous manner using RNNs, where the system predicts speech probability scores for discrete frames within a future window. The continuous predictions represent generalized turn-taking behaviors observed in the training data and can be applied to make decisions that are not just limited to end-of-turn detection. In this paper, we investigate optimal speech-related feature sets for making predictions at pauses and overlaps in conversation. We find that while traditional acoustic features perform well, part-of-speech features generally perform worse than word features. We show that our current models outperform previously reported baselines.

  • 45.
    Sturm, Bob
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Ben-Tal, Oded
    Kingston University, UK.
    Let’s Have Another Gan Ainm: An experimental album of Irish traditional music and computer-generated tunes2018Report (Other academic)
    Abstract [en]

    This technical report details the creation and public release of an album of folk music, most which comes from material generated by computer models trained on transcriptions of traditional music of Ireland and the UK.For each computer-generated tune appearing on the album, we provide below the original version and the alterations made.

  • 46.
    Malisz, Zofia
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Żygis, M.
    Lexical stress in Polish: Evidence from focus and phrase-position differentiated production data2018In: Proceedings of the International Conference on Speech Prosody, International Speech Communications Association , 2018, p. 1008-1012Conference paper (Refereed)
    Abstract [en]

    We examine acoustic patterns of word stress in Polish in data with carefully separated phrase- and word-level prominences. We aim to verify claims in the literature regarding the phonetic and phonological status of lexical stress (both primary and secondary) in Polish and to contribute to a better understanding of prosodic prominence and boundary interactions. Our results show significant effects of primary stress on acoustic parameters such as duration, f0 functionals and spectral emphasis expected for a stress language. We do not find clear and systematic acoustic evidence for secondary stress.

  • 47.
    Pabon, Peter
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Institute of Sonology, Royal Conservatory.
    Mapping Individual Voice Quality over the Voice Range: The Measurement Paradigm of the Voice Range Profile2018Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    The acoustic signal of voiced sounds has two primary attributes: fundamental frequency and sound level. It has also very many secondary attributes, or ‘voice qualities’, that can be derived from the acoustic signal, in particular from its periodicity and its spectrum. Acoustic voice analysis as a discipline is largely concerned with identifying and quantifying those qualities or parameters that are relevant for assessing the health or training status of a voice or that characterize the individual quality. The thesis presented here is that all such voice qualities covary essentially and individually with the fundamental frequency and the sound level, and that methods for assessing the voice must account for this covariation and individuality. The central interest in the "voice field" measurement paradigm becomes to map the proportional dependencies that exist between voice parameters. The five studies contribute to ways of doing this in practice, while the framework text presents the theoretical basis for the analysis model in relation to the practical principles.

  • 48.
    Elowsson, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.
    Modeling Music: Studies of Music Transcription, Music Perception and Music Production2018Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    This dissertation presents ten studies focusing on three important subfields of music information retrieval (MIR): music transcription (Part A), music perception (Part B), and music production (Part C).

    In Part A, systems capable of transcribing rhythm and polyphonic pitch are described. The first two publications present methods for tempo estimation and beat tracking. A method is developed for computing the most salient periodicity (the “cepstroid”), and the computed cepstroid is used to guide the machine learning processing. The polyphonic pitch tracking system uses novel pitch-invariant and tone-shift-invariant processing techniques. Furthermore, the neural flux is introduced – a latent feature for onset and offset detection. The transcription systems use a layered learning technique with separate intermediate networks of varying depth.  Important music concepts are used as intermediate targets to create a processing chain with high generalization. State-of-the-art performance is reported for all tasks.

    Part B is devoted to perceptual features of music, which can be used as intermediate targets or as parameters for exploring fundamental music perception mechanisms. Systems are proposed that can predict the perceived speed and performed dynamics of an audio file with high accuracy, using the average ratings from around 20 listeners as ground truths. In Part C, aspects related to music production are explored. The first paper analyzes long-term average spectrum (LTAS) in popular music. A compact equation is derived to describe the mean LTAS of a large dataset, and the variation is visualized. Further analysis shows that the level of the percussion is an important factor for LTAS. The second paper examines songwriting and composition through the development of an algorithmic composer of popular music. Various factors relevant for writing good compositions are encoded, and a listening test employed that shows the validity of the proposed methods.

    The dissertation is concluded by Part D - Looking Back and Ahead, which acts as a discussion and provides a road-map for future work. The first paper discusses the deep layered learning (DLL) technique, outlining concepts and pointing out a direction for future MIR implementations. It is suggested that DLL can help generalization by enforcing the validity of intermediate representations, and by letting the inferred representations establish disentangled structures supporting high-level invariant processing. The second paper proposes an architecture for tempo-invariant processing of rhythm with convolutional neural networks. Log-frequency representations of rhythm-related activations are suggested at the main stage of processing. Methods relying on magnitude, relative phase, and raw phase information are described for a wide variety of rhythm processing tasks.

  • 49. Roddy, Matthew
    et al.
    Skantze, Gabriel
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Harte, Naomi
    Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs2018In: ICMI 2018 - Proceedings of the 2018 International Conference on Multimodal Interaction, 2018, p. 186-190Conference paper (Refereed)
    Abstract [en]

    In human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities. To design spoken dialog systems that can conduct fluid interactions it is desirable to incorporate cues from separate modalities into turn-taking models. We propose that there is an appropriate temporal granularity at which modalities should be modeled. We design a multiscale RNN architecture to model modalities at separate timescales in a continuous manner. Our results show that modeling linguistic and acoustic features at separate temporal rates can be beneficial for turn-taking modeling. We also show that our approach can be used to incorporate gaze features into turn-taking models.

  • 50.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Sibirtseva, Elena
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, perception and learning, RPL.
    Pereira, André
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Multimodal reference resolution in collaborative assembly tasks2018In: Multimodal reference resolution in collaborative assembly tasks, ACM Digital Library, 2018Conference paper (Refereed)
1234567 1 - 50 of 1128
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf