Change search
Refine search result
1 - 22 of 22
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Al Moubayed, Samer
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Granström, Björn
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    A robotic head using projected animated faces2011In: Proceedings of the International Conference on Audio-Visual Speech Processing 2011 / [ed] Salvi, G.; Beskow, J.; Engwall, O.; Al Moubayed, S., Stockholm: KTH Royal Institute of Technology, 2011, p. 71-Conference paper (Refereed)
    Abstract [en]

    This paper presents a setup which employs virtual animatedagents for robotic heads. The system uses a laser projector toproject animated faces onto a three dimensional face mask. This approach of projecting animated faces onto a three dimensional head surface as an alternative to using flat, two dimensional surfaces, eliminates several deteriorating effects and illusions that come with flat surfaces for interaction purposes, such as exclusive mutual gaze and situated and multi-partner dialogues. In addition to that, it provides robotic heads with a flexible solution for facial animation which takes into advantage the advancements of facial animation using computer graphics overmechanically controlled heads.

  • 2.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Performance, Processing and Perception of Communicative Motion for Avatars and Agents2017Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Artificial agents and avatars are designed with a large variety of face and body configurations. Some of these (such as virtual characters in films) may be highly realistic and human-like, while others (such as social robots) have considerably more limited expressive means. In both cases, human motion serves as the model and inspiration for the non-verbal behavior displayed. This thesis focuses on increasing the expressive capacities of artificial agents and avatars using two main strategies: 1) improving the automatic capturing of the most communicative areas for human communication, namely the face and the fingers, and 2) increasing communication clarity by proposing novel ways of eliciting clear and readable non-verbal behavior.

    The first part of the thesis covers automatic methods for capturing and processing motion data. In paper A, we propose a novel dual sensor method for capturing hands and fingers using optical motion capture in combination with low-cost instrumented gloves. The approach circumvents the main problems with marker-based systems and glove-based systems, and it is demonstrated and evaluated on a key-word signing avatar. In paper B, we propose a robust method for automatic labeling of sparse, non-rigid motion capture marker sets, and we evaluate it on a variety of marker configurations for finger and facial capture. In paper C, we propose an automatic method for annotating hand gestures using Hierarchical Hidden Markov Models (HHMMs).

    The second part of the thesis covers studies on creating and evaluating multimodal databases with clear and exaggerated motion. The main idea is that this type of motion is appropriate for agents under certain communicative situations (such as noisy environments) or for agents with reduced expressive degrees of freedom (such as humanoid robots). In paper D, we record motion capture data for a virtual talking head with variable articulation style (normal-to-over articulated). In paper E, we use techniques from mime acting to generate clear non-verbal expressions custom tailored for three agent embodiments (face-and-body, face-only and body-only).

  • 3.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions2014In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 28, no 2, p. 607-618Article in journal (Refereed)
    Abstract [en]

    In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a,set of 180 short sentences, under three conditions: normal speech (in quiet), Lombard speech (in noise), and whispering. We then produced an animated 3D avatar with similar shape and appearance as the original talker and used an error minimization procedure to drive the animated version of the talker in a way that matched the original performance as closely as possible. In a perceptual intelligibility study with degraded audio we then compared the animated talker against the real talker and the audio alone, in terms of audio-visual word recognition rate across the three different production conditions. We found that the visual intelligibility of the animated talker was on par with the real talker for the Lombard and whisper conditions. In addition we created two incongruent conditions where normal speech audio was paired with animated Lombard speech or whispering. When compared to the congruent normal speech condition, Lombard animation yields a significant increase in intelligibility, despite the AV-incongruence. In a separate evaluation, we gathered subjective opinions on the different animations, and found that some degree of incongruence was generally accepted.

  • 4.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Can Anybody Read Me? Motion Capture Recordings for an Adaptable Visual Speech Synthesizer2012In: In proceedings of The Listening Talker, Edinburgh, UK., 2012, p. 52-52Conference paper (Refereed)
  • 5.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Towards Fully Automated Motion Capture of Signs -- Development and Evaluation of a Key Word Signing Avatar2015In: ACM Transactions on Accessible Computing, ISSN 1936-7228, Vol. 7, no 2, p. 7:1-7:17Article in journal (Refereed)
    Abstract [en]

    Motion capture of signs provides unique challenges in the field of multimodal data collection. The dense packaging of visual information requires high fidelity and high bandwidth of the captured data. Even though marker-based optical motion capture provides many desirable features such as high accuracy, global fitting, and the ability to record body and face simultaneously, it is not widely used to record finger motion, especially not for articulated and syntactic motion such as signs. Instead, most signing avatar projects use costly instrumented gloves, which require long calibration procedures. In this article, we evaluate the data quality obtained from optical motion capture of isolated signs from Swedish sign language with a large number of low-cost cameras. We also present a novel dual-sensor approach to combine the data with low-cost, five-sensor instrumented gloves to provide a recording method with low manual postprocessing. Finally, we evaluate the collected data and the dual-sensor approach as transferred to a highly stylized avatar. The application of the avatar is a game-based environment for training Key Word Signing (KWS) as augmented and alternative communication (AAC), intended for children with communication disabilities.

  • 6.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Aspects of co-occurring syllables and head nods in spontaneous dialogue2013In: Proceedings of 12th International Conference on Auditory-Visual Speech Processing (AVSP2013), 2013, p. 169-172Conference paper (Refereed)
    Abstract [en]

    This paper reports on the extraction and analysis of head nods taken from motion capture data of spontaneous dialogue in Swedish. The head nods were extracted automatically and then manually classified in terms of gestures having a beat function or multifunctional gestures. Prosodic features were extracted from syllables co-occurring with the beat gestures. While the peak rotation of the nod is on average aligned with the stressed syllable, the results show considerable variation in fine temporal synchronization. The syllables co-occurring with the gestures generally show greater intensity, higher F0, and greater F0 range when compared to the mean across the entire dialogue. A functional analysis shows that the majority of the syllables belong to words bearing a focal accent.

  • 7.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Automatic annotation of gestural units in spontaneous face-to-face interaction2016In: MA3HMI 2016 - Proceedings of the Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, 2016, p. 15-19Conference paper (Refereed)
    Abstract [en]

    Speech and gesture co-occur in spontaneous dialogue in a highly complex fashion. There is a large variability in the motion that people exhibit during a dialogue, and different kinds of motion occur during different states of the interaction. A wide range of multimodal interface applications, for example in the fields of virtual agents or social robots, can be envisioned where it is important to be able to automatically identify gestures that carry information and discriminate them from other types of motion. While it is easy for a human to distinguish and segment manual gestures from a flow of multimodal information, the same task is not trivial to perform for a machine. In this paper we present a method to automatically segment and label gestural units from a stream of 3D motion capture data. The gestural flow is modeled with a 2-level Hierarchical Hidden Markov Model (HHMM) where the sub-states correspond to gesture phases. The model is trained based on labels of complete gesture units and self-adaptive manipulators. The model is tested and validated on two datasets differing in genre and in method of capturing motion, and outperforms a state-of-the-art SVM classifier on a publicly available dataset.

  • 8.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Extracting and analysing co-speech head gestures from motion-capture data2013In: Proceedings of Fonetik 2013 / [ed] Eklund, Robert, Linköping University Electronic Press, 2013, p. 1-4Conference paper (Refereed)
  • 9.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Extracting and analyzing head movements accompanying spontaneous dialogue2013In: Conference Proceedings TiGeR 2013: Tilburg Gesture Research Meeting, 2013Conference paper (Refereed)
    Abstract [en]

    This paper reports on a method developed for extracting and analyzing head gestures taken from motion capture data of spontaneous dialogue in Swedish. Candidate head gestures with beat function were extracted automatically and then manually classified using a 3D player which displays timesynced audio and 3D point data of the motion capture markers together with animated characters. Prosodic features were extracted from syllables co-occurring with a subset of the classified gestures. The beat gestures show considerable variation in temporal synchronization with the syllables, while the syllables generally show greater intensity, higher F0, and greater F0 range when compared to the mean across the entire dialogue. Additional features for further analysis and automatic classification of the head gestures are discussed.

  • 10.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    O'Sullivan, C.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Robust online motion capture labeling of finger markers2016In: Proceedings - Motion in Games 2016: 9th International Conference on Motion in Games, MIG 2016, ACM Digital Library, 2016, p. 7-13Conference paper (Refereed)
    Abstract [en]

    Passive optical motion capture is one of the predominant technologies for capturing high fidelity human skeletal motion, and is a workhorse in a large number of areas such as bio-mechanics, film and video games. While most state-of-the-art systems can automatically identify and track markers on the larger parts of the human body, the markers attached to fingers provide unique challenges and usually require extensive manual cleanup. In this work we present a robust online method for identification and tracking of passive motion capture markers attached to the fingers of the hands. The method is especially suited for large capture volumes and sparse marker sets of 3 to 10 markers per hand. Once trained, our system can automatically initialize and track the markers, and the subject may exit and enter the capture volume at will. By using multiple assignment hypotheses and soft decisions, it can robustly recover from a difficult situation with many simultaneous occlusions and false observations (ghost markers). We evaluate the method on a collection of sparse marker sets commonly used in industry and in the research community. We also compare the results with two of the most widely used motion capture platforms: Motion Analysis Cortex and Vicon Blade. The results show that our method is better at attaining correct marker labels and is especially beneficial for real-time applications.

  • 11.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    O'Sullivan, Carol
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Real-time labeling of non-rigid motion capture marker sets2017In: Computers & graphics, ISSN 0097-8493, E-ISSN 1873-7684, Vol. 69, no Supplement C, p. 59-67Article in journal (Refereed)
    Abstract [en]

    Passive optical motion capture is one of the predominant technologies for capturing high fidelity human motion, and is a workhorse in a large number of areas such as bio-mechanics, film and video games. While most state-of-the-art systems can automatically identify and track markers on the larger parts of the human body, the markers attached to the fingers and face provide unique challenges and usually require extensive manual cleanup. In this work we present a robust online method for identification and tracking of passive motion capture markers attached to non-rigid structures. The method is especially suited for large capture volumes and sparse marker sets. Once trained, our system can automatically initialize and track the markers, and the subject may exit and enter the capture volume at will. By using multiple assignment hypotheses and soft decisions, it can robustly recover from a difficult situation with many simultaneous occlusions and false observations (ghost markers). In three experiments, we evaluate the method for labeling a variety of marker configurations for finger and facial capture. We also compare the results with two of the most widely used motion capture platforms: Motion Analysis Cortex and Vicon Blade. The results show that our method is better at attaining correct marker labels and is especially beneficial for real-time applications.

  • 12.
    Alexanderson, Simon
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    O'Sullivan, Carol
    Neff, Michael
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Mimebot—Investigating the Expressibility of Non-Verbal Communication Across Agent Embodiments2017In: ACM Transactions on Applied Perception, ISSN 1544-3558, E-ISSN 1544-3965, Vol. 14, no 4, article id 24Article in journal (Refereed)
    Abstract [en]

    Unlike their human counterparts, artificial agents such as robots and game characters may be deployed with a large variety of face and body configurations. Some have articulated bodies but lack facial features, and others may be talking heads ending at the neck. Generally, they have many fewer degrees of freedom than humans through which they must express themselves, and there will inevitably be a filtering effect when mapping human motion onto the agent. In this article, we investigate filtering effects on three types of embodiments: (a) an agent with a body but no facial features, (b) an agent with a head only, and (c) an agent with a body and a face. We performed a full performance capture of a mime actor enacting short interactions varying the non-verbal expression along five dimensions (e.g., level of frustration and level of certainty) for each of the three embodiments. We performed a crowd-sourced evaluation experiment comparing the video of the actor to the video of an animated robot for the different embodiments and dimensions. Our findings suggest that the face is especially important to pinpoint emotional reactions but is also most volatile to filtering effects. The body motion, on the other hand, had more diverse interpretations but tended to preserve the interpretation after mapping and thus proved to be more resilient to filtering.

  • 13.
    Beskow, Jonas
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Al Moubayed, Samer
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Edlund, Jens
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Kinetic Data for Large-Scale Analysis and Modeling of Face-to-Face Conversation2011In: Proceedings of International Conference on Audio-Visual Speech Processing 2011 / [ed] Salvi, G.; Beskow, J.; Engwall, O.; Al Moubayed, S., Stockholm: KTH Royal Institute of Technology, 2011, p. 103-106Conference paper (Refereed)
    Abstract [en]

    Spoken face to face interaction is a rich and complex form of communication that includes a wide array of phenomena thatare not fully explored or understood. While there has been extensive studies on many aspects in face-to-face interaction, these are traditionally of a qualitative nature, relying on hand annotated corpora, typically rather limited in extent, which is a natural consequence of the labour intensive task of multimodal data annotation. In this paper we present a corpus of 60 hours of unrestricted Swedish face-to-face conversations recorded with audio, video and optical motion capture, and we describe a new project setting out to exploit primarily the kinetic data in this corpus in order to gain quantitative knowledge on humanface-to-face interaction.

  • 14.
    Beskow, Jonas
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Stefanov, Kalin
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Claesson, Britt
    Derbring, Sandra
    Fredriksson, Morgan
    The Tivoli System - A Sign-driven Game for Children with Communicative Disorders2013Conference paper (Refereed)
  • 15.
    Beskow, Jonas
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Stefanov, Kalin
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Claesson, Britt
    Derbring, Sandra
    Fredriksson, Morgan
    Starck, J.
    Axelsson, E.
    Tivoli - Learning Signs Through Games and Interaction for Children with Communicative Disorders2014Conference paper (Refereed)
  • 16.
    Edlund, Jens
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Gustavsson, Lisa
    Heldner, Mattias
    (Stockholm University, Faculty of Humanities, Department of Linguistics, Phonetics) .
    Hjalmarsson, Anna
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Kallionen, Petter
    Marklund, Ellen
    3rd party observer gaze as a continuous measure of dialogue flow2012In: LREC 2012 - Eighth International Conference On Language Resources And Evaluation, Istanbul, Turkey: European Language Resources Association, 2012, p. 1354-1358Conference paper (Refereed)
    Abstract [en]

    We present an attempt at using 3rd party observer gaze to get a measure of how appropriate each segment in a dialogue is for a speaker change. The method is a step away from the current dependency of speaker turns or talkspurts towards a more general view of speaker changes. We show that 3rd party observers do indeed largely look at the same thing (the speaker), and how this can be captured and utilized to provide insights into human communication. In addition, the results also suggest that there might be differences in the distribution of 3rd party observer gaze depending on how information-rich an utterance is.

  • 17.
    Frid, Emma
    et al.
    KTH, School of Computer Science and Communication (CSC), Media Technology and Interaction Design, MID.
    Bresin, Roberto
    KTH, School of Computer Science and Communication (CSC), Media Technology and Interaction Design, MID.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Perception of Mechanical Sounds Inherent to Expressive Gestures of a NAO Robot - Implications for Movement Sonification of Humanoids2018In: Proceedings of the 15th Sound and Music Computing Conference / [ed] Anastasia Georgaki and Areti Andreopoulou, Limassol, Cyprus, 2018Conference paper (Refereed)
    Abstract [en]

    In this paper we present a pilot study carried out within the project SONAO. The SONAO project aims to compen- sate for limitations in robot communicative channels with an increased clarity of Non-Verbal Communication (NVC) through expressive gestures and non-verbal sounds. More specifically, the purpose of the project is to use move- ment sonification of expressive robot gestures to improve Human-Robot Interaction (HRI). The pilot study described in this paper focuses on mechanical robot sounds, i.e. sounds that have not been specifically designed for HRI but are inherent to robot movement. Results indicated a low correspondence between perceptual ratings of mechanical robot sounds and emotions communicated through ges- tures. In general, the mechanical sounds themselves ap- peared not to carry much emotional information compared to video stimuli of expressive gestures. However, some mechanical sounds did communicate certain emotions, e.g. frustration. In general, the sounds appeared to commu- nicate arousal more effectively than valence. We discuss potential issues and possibilities for the sonification of ex- pressive robot gestures and the role of mechanical sounds in such a context. Emphasis is put on the need to mask or alter sounds inherent to robot movement, using for exam- ple blended sonification.

  • 18.
    House, David
    et al.
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Beskow, Jonas
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    On the temporal domain of co-speech gestures: syllable, phrase or talk spurt?2015In: Proceedings of Fonetik 2015 / [ed] Lundmark Svensson, M.; Ambrazaitis, G.; van de Weijer, J., 2015, p. 63-68Conference paper (Other academic)
    Abstract [en]

    This study explores the use of automatic methods to detect and extract handgesture movement co-occuring with speech. Two spontaneous dyadic dialogueswere analyzed using 3D motion-capture techniques to track hand movement.Automatic speech/non-speech detection was performed on the dialogues resultingin a series of connected talk spurts for each speaker. Temporal synchrony of onsetand offset of gesture and speech was studied between the automatic hand gesturetracking and talk spurts, and compared to an earlier study of head nods andsyllable synchronization. The results indicated onset synchronization between headnods and the syllable in the short temporal domain and between the onset of longergesture units and the talk spurt in a more extended temporal domain.

  • 19.
    Karipidou, Kelly
    et al.
    KTH, School of Computer Science and Communication (CSC), Robotics, perception and learning, RPL.
    Ahnlund, Josefin
    KTH, School of Computer Science and Communication (CSC), Robotics, perception and learning, RPL.
    Friberg, Anders
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.
    Kjellström, Hedvig
    KTH, School of Computer Science and Communication (CSC), Robotics, perception and learning, RPL.
    Computer Analysis of Sentiment Interpretation in Musical Conducting2017In: Proceedings - 12th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2017, IEEE, 2017, p. 400-405, article id 7961769Conference paper (Refereed)
    Abstract [en]

    This paper presents a unique dataset consisting of 20 recordings of the same musical piece, conducted with 4 different musical intentions in mind. The upper body and baton motion of a professional conductor was recorded, as well as the sound of each instrument in a professional string quartet following the conductor. The dataset is made available for benchmarking of motion recognition algorithms. An HMM-based emotion intent classification method is trained with subsets of the data, and classification of other subsets of the data show firstly that the motion of the baton communicates energetic intention to a high degree, secondly, that the conductor’s torso, head and other arm conveys calm intention to a high degree, and thirdly, that positive vs negative sentiments are communicated to a high degree through other channels than the body and baton motion – most probably, through facial expression and muscle tension conveyed through articulated hand and finger motion. The long-term goal of this work is to develop a computer model of the entire conductor-orchestra communication pro- cess; the studies presented here indicate that computer modeling of the conductor-orchestra communication is feasible.

  • 20.
    Kontogiorgos, Dimosthenis
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Avramova, Vanya
    KTH.
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Jonell, Patrik
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Oertel, Catharine
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH. Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland.
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Skantze, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Gustafson, Joakim
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    A Multimodal Corpus for Mutual Gaze and Joint Attention in Multiparty Situated Interaction2018In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, 2018, p. 119-127Conference paper (Refereed)
    Abstract [en]

    In this paper we present a corpus of multiparty situated interaction where participants collaborated on moving virtual objects on a large touch screen. A moderator facilitated the discussion and directed the interaction. The corpus contains recordings of a variety of multimodal data, in that we captured speech, eye gaze and gesture data using a multisensory setup (wearable eye trackers, motion capture and audio/video). Furthermore, in the description of the multimodal corpus, we investigate four different types of social gaze: referential gaze, joint attention, mutual gaze and gaze aversion by both perspectives of a speaker and a listener. We annotated the groups’ object references during object manipulation tasks and analysed the group’s proportional referential eye-gaze with regards to the referent object. When investigating the distributions of gaze during and before referring expressions we could corroborate the differences in time between speakers’ and listeners’ eye gaze found in earlier studies. This corpus is of particular interest to researchers who are interested in social eye-gaze patterns in turn-taking and referring language in situated multi-party interaction.

  • 21.
    Vijayan, Aravind Elanjimattathil
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Alexanderson, Simon
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Beskow, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH.
    Leite, Iolanda
    KTH, School of Electrical Engineering and Computer Science (EECS), Robotics, Perception and Learning, RPL.
    Using Constrained Optimization for Real-Time Synchronization of Verbal and Nonverbal Robot Behavior2018In: 2018 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), IEEE Computer Society, 2018, p. 1955-1961Conference paper (Refereed)
    Abstract [en]

    Most of the motion re-targeting techniques are grounded on virtual character animation research, which means that they typically assume that the target embodiment has unconstrained joint angular velocities. However, because robots often do have such constraints, traditional re-targeting approaches can originate irregular delays in the robot motion. With the goal of ensuring synchronization between verbal and nonverbal behavior, this paper proposes an optimization framework for processing re-targeted motion sequences that addresses constraints such as joint angle and angular velocities. The proposed framework was evaluated on a humanoid robot using both objective and subjective metrics. While the analysis of the joint motion trajectories provides evidence that our framework successfully performs the desired modifications to ensure verbal and nonverbal behavior synchronization, results from a perceptual study showed that participants found the robot motion generated by our method more natural, elegant and lifelike than a control condition.

  • 22. Zellers, M.
    et al.
    House, David
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Alexanderson, Simon
    KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
    Prosody and hand gesture at turn boundaries in Swedish2016In: Proceedings of the International Conference on Speech Prosody, International Speech Communications Association , 2016, p. 831-835Conference paper (Refereed)
    Abstract [en]

    In order to ensure smooth turn-taking between conversational participants, interlocutors must have ways of providing information to one another about whether they have finished speaking or intend to continue. The current work investigates Swedish speakers’ use of hand gestures in conjunction with turn change or turn hold in unrestricted, spontaneous speech. As has been reported by other researchers, we find that speakers’ gestures end before the end of speech in cases of turn change, while they may extend well beyond the end of a given speech chunk in the case of turn hold. We investigate the degree to which prosodic cues and gesture cues to turn transition in Swedish face-to-face conversation are complementary versus functioning additively. The co-occurrence of acoustic prosodic features and gesture at potential turn boundaries gives strong support for considering hand gestures as part of the prosodic system, particularly in the context of discourse-level information such as maintaining smooth turn transition.

1 - 22 of 22
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf