Change search
Refine search result
1 - 7 of 7
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Abelho Pereira, André Tiago
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Oertel, Catharine
    TU Delft Delft, Netherlands.
    Fermoselle, Leonor
    TNO Den Haag, Netherlands.
    Mendelson, Joe
    Furhat Robotics Stockholm, Sweden.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Effects of Different Interaction Contexts when Evaluating Gaze Models in HRI2020Conference paper (Refereed)
    Abstract [en]

    uses multimodal information from users engaged in a spatial reasoningtask with a robot and communicates joint attention viathe robot’s gaze behavior [25]. An initial evaluation of our systemwith adults showed it to improve users’ perceptions of therobot’s social presence. To investigate the repeatability of our priorfindings across settings and populations, here we conducted twofurther studies employing the same gaze system with the samerobot and task but in different contexts: evaluation of the systemwith external observers and evaluation with children. The externalobserver study suggests that third-person perspectives over videosof gaze manipulations can be used either as a manipulation checkbefore committing to costly real-time experiments or to furtherestablish previous findings. However, the replication of our originaladults study with children in school did not confirm the effectivenessof our gaze manipulation, suggesting that different interactioncontexts can affect the generalizability of results in human-robotinteraction gaze studies.

  • 2.
    Abelho Pereira, André Tiago
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Oertel, Catharine
    Computer-Human Interaction Lab for Learning & Instruction Ecole Polytechnique Federale de Lausanne, Switzerland..
    Fermoselle, Leonor
    Mendelson, Joe
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Responsive Joint Attention in Human-Robot Interaction2019Conference paper (Refereed)
    Abstract [en]

    Joint attention has been shown to be not only crucial for human-human interaction but also human-robot interaction. Joint attention can help to make cooperation more efficient, support disambiguation in instances of uncertainty and make interactions appear more natural and familiar. In this paper, we present an autonomous gaze system that uses multimodal perception capabilities to model responsive joint attention mechanisms. We investigate the effects of our system on people’s perception of a robot within a problem-solving task. Results from a user study suggest that responsive joint attention mechanisms evoke higher perceived feelings of social presence on scales that regard the direction of the robot’s perception.

  • 3. Arnela, Marc
    et al.
    Dabbaghchian, Saeed
    Guasch, Oriol
    Engwall, Olov
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligenta system, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    MRI-based vocal tract representations for the three-dimensional finite element synthesis of diphthongs2019In: IEEE Transactions on Audio, Speech, and Language Processing, ISSN 1558-7916, E-ISSN 1558-7924, Vol. 27, no 12, p. 2173-2182Article in journal (Refereed)
    Abstract [en]

    The synthesis of diphthongs in three-dimensions (3D) involves the simulation of acoustic waves propagating through a complex 3D vocal tract geometry that deforms over time. Accurate 3D vocal tract geometries can be extracted from Magnetic Resonance Imaging (MRI), but due to long acquisition times, only static sounds can be currently studied with an adequate spatial resolution. In this work, 3D dynamic vocal tract representations are built to generate diphthongs, based on a set of cross-sections extracted from MRI-based vocal tract geometries of static vowel sounds. A diphthong can then be easily generated by interpolating the location, orientation and shape of these cross-sections, thus avoiding the interpolation of full 3D geometries. Two options are explored to extract the cross-sections. The first one is based on an adaptive grid (AG), which extracts the cross-sections perpendicular to the vocal tract midline, whereas the second one resorts to a semi-polar grid (SPG) strategy, which fixes the cross-section orientations. The finite element method (FEM) has been used to solve the mixed wave equation and synthesize diphthongs [${\alpha i}$] and [${\alpha u}$] in the dynamic 3D vocal tracts. The outputs from a 1D acoustic model based on the Transfer Matrix Method have also been included for comparison. The results show that the SPG and AG provide very close solutions in 3D, whereas significant differences are observed when using them in 1D. The SPG dynamic vocal tract representation is recommended for 3D simulations because it helps to prevent the collision of adjacent cross-sections.

  • 4.
    Götze, Jana
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Boye, Johan
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    SpaceRef: a corpus of street-level geographic descriptions2016In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), 2016, p. 3822-3827Conference paper (Refereed)
    Abstract [en]

    This article describes SPACEREF, a corpus of street-level geographic descriptions. Pedestrians are walking a route in a (real) urban environment, describing their actions. Their position is automatically logged, their speech is manually transcribed, and their references to objects are manually annotated with respect to a crowdsourced geographic database. We describe how the data was collected and annotated, and how it has been used in the context of creating resources for an automatic pedestrian navigation system.

  • 5.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Abelho Pereira, André Tiago
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Sahindal, Boran
    KTH.
    van Waveren, Sanne
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
    Behavioural Responses to Robot Conversational Failures2020Conference paper (Refereed)
    Abstract [en]

    Humans and robots will increasingly collaborate in domestic environments which will cause users to encounter more failures in interactions. Robots should be able to infer conversational failures by detecting human users’ behavioural and social signals. In this paper, we study and analyse these behavioural cues in response to robot conversational failures. Using a guided task corpus, where robot embodiment and time pressure are manipulated, we ask human annotators to estimate whether user affective states differ during various types of robot failures. We also train a random forest classifier to detect whether a robot failure has occurred and compare results to human annotator benchmarks. Our findings show that human-like robots augment users’ reactions to failures, as shown in users’ visual attention, in comparison to non-humanlike smart-speaker embodiments. The results further suggest that speech behaviours are utilised more in responses to failures when non-human-like designs are present. This is particularly important to robot failure detection mechanisms that may need to consider the robot’s physical design in its failure detection model.

  • 6.
    Näslund, Per
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Artificial Neural Networks in Swedish Speech Synthesis2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Text-to-speech (TTS) systems have entered our daily lives in the form of smart assistants and many other applications. Contemporary re- search applies machine learning and artificial neural networks (ANNs) to synthesize speech. It has been shown that these systems outperform the older concatenative and parametric methods.

    In this paper, ANN-based methods for speech synthesis are ex- plored and one of the methods is implemented for the Swedish lan- guage. The implemented method is dubbed “Tacotron” and is a first step towards end-to-end ANN-based TTS which puts many differ- ent ANN-techniques to work. The resulting system is compared to a parametric TTS through a strength-of-preference test that is carried out with 20 Swedish speaking subjects. A statistically significant pref- erence for the ANN-based TTS is found. Test subjects indicate that the ANN-based TTS performs better than the parametric TTS when it comes to audio quality and naturalness but sometimes lacks in intelli- gibility.

  • 7.
    Stefanov, Kalin
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Speech Communication and Technology. University of Southern California.
    Salvi, Giampiero (Contributor)
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition2019In: IEEE Transactions on Cognitive and Developmental Systems, ISSN 2379-8920Article in journal (Refereed)
    Abstract [en]

    This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.

1 - 7 of 7
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf