Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Performance, Processing and Perception of Communicative Motion for Avatars and Agents
KTH, Skolan för datavetenskap och kommunikation (CSC), Tal, musik och hörsel, TMH.ORCID-id: 0000-0002-7801-7617
2017 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

Artificial agents and avatars are designed with a large variety of face and body configurations. Some of these (such as virtual characters in films) may be highly realistic and human-like, while others (such as social robots) have considerably more limited expressive means. In both cases, human motion serves as the model and inspiration for the non-verbal behavior displayed. This thesis focuses on increasing the expressive capacities of artificial agents and avatars using two main strategies: 1) improving the automatic capturing of the most communicative areas for human communication, namely the face and the fingers, and 2) increasing communication clarity by proposing novel ways of eliciting clear and readable non-verbal behavior.

The first part of the thesis covers automatic methods for capturing and processing motion data. In paper A, we propose a novel dual sensor method for capturing hands and fingers using optical motion capture in combination with low-cost instrumented gloves. The approach circumvents the main problems with marker-based systems and glove-based systems, and it is demonstrated and evaluated on a key-word signing avatar. In paper B, we propose a robust method for automatic labeling of sparse, non-rigid motion capture marker sets, and we evaluate it on a variety of marker configurations for finger and facial capture. In paper C, we propose an automatic method for annotating hand gestures using Hierarchical Hidden Markov Models (HHMMs).

The second part of the thesis covers studies on creating and evaluating multimodal databases with clear and exaggerated motion. The main idea is that this type of motion is appropriate for agents under certain communicative situations (such as noisy environments) or for agents with reduced expressive degrees of freedom (such as humanoid robots). In paper D, we record motion capture data for a virtual talking head with variable articulation style (normal-to-over articulated). In paper E, we use techniques from mime acting to generate clear non-verbal expressions custom tailored for three agent embodiments (face-and-body, face-only and body-only).

Ort, förlag, år, upplaga, sidor
Stockholm: KTH Royal Institute of Technology, 2017. , s. 73
Serie
TRITA-CSC-A, ISSN 1653-5723 ; 24
Nationell ämneskategori
Data- och informationsvetenskap
Forskningsämne
Tal- och musikkommunikation
Identifikatorer
URN: urn:nbn:se:kth:diva-218272ISBN: 978-91-7729-608-9 (tryckt)OAI: oai:DiVA.org:kth-218272DiVA, id: diva2:1160154
Disputation
2017-12-15, F3, Lindstedtsvägen 26, Stockholm, 14:00 (Engelska)
Opponent
Handledare
Anmärkning

QC 20171127

Tillgänglig från: 2017-11-27 Skapad: 2017-11-24 Senast uppdaterad: 2018-01-13Bibliografiskt granskad
Delarbeten
1. Towards Fully Automated Motion Capture of Signs -- Development and Evaluation of a Key Word Signing Avatar
Öppna denna publikation i ny flik eller fönster >>Towards Fully Automated Motion Capture of Signs -- Development and Evaluation of a Key Word Signing Avatar
2015 (Engelska)Ingår i: ACM Transactions on Accessible Computing, ISSN 1936-7228, Vol. 7, nr 2, s. 7:1-7:17Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Motion capture of signs provides unique challenges in the field of multimodal data collection. The dense packaging of visual information requires high fidelity and high bandwidth of the captured data. Even though marker-based optical motion capture provides many desirable features such as high accuracy, global fitting, and the ability to record body and face simultaneously, it is not widely used to record finger motion, especially not for articulated and syntactic motion such as signs. Instead, most signing avatar projects use costly instrumented gloves, which require long calibration procedures. In this article, we evaluate the data quality obtained from optical motion capture of isolated signs from Swedish sign language with a large number of low-cost cameras. We also present a novel dual-sensor approach to combine the data with low-cost, five-sensor instrumented gloves to provide a recording method with low manual postprocessing. Finally, we evaluate the collected data and the dual-sensor approach as transferred to a highly stylized avatar. The application of the avatar is a game-based environment for training Key Word Signing (KWS) as augmented and alternative communication (AAC), intended for children with communication disabilities.

Ort, förlag, år, upplaga, sidor
New York, NY, USA: Association for Computing Machinery (ACM), 2015
Nyckelord
Augmentative and alternative communication (AAC), Motion capture, Sign language, Virtual characters
Nationell ämneskategori
Datavetenskap (datalogi) Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-180427 (URN)10.1145/2764918 (DOI)000360070800004 ()2-s2.0-84935145760 (Scopus ID)
Anmärkning

 QC 2016-01-13

Tillgänglig från: 2016-01-13 Skapad: 2016-01-13 Senast uppdaterad: 2018-01-10Bibliografiskt granskad
2. Real-time labeling of non-rigid motion capture marker sets
Öppna denna publikation i ny flik eller fönster >>Real-time labeling of non-rigid motion capture marker sets
2017 (Engelska)Ingår i: Computers & graphics, ISSN 0097-8493, E-ISSN 1873-7684, Vol. 69, nr Supplement C, s. 59-67Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Passive optical motion capture is one of the predominant technologies for capturing high fidelity human motion, and is a workhorse in a large number of areas such as bio-mechanics, film and video games. While most state-of-the-art systems can automatically identify and track markers on the larger parts of the human body, the markers attached to the fingers and face provide unique challenges and usually require extensive manual cleanup. In this work we present a robust online method for identification and tracking of passive motion capture markers attached to non-rigid structures. The method is especially suited for large capture volumes and sparse marker sets. Once trained, our system can automatically initialize and track the markers, and the subject may exit and enter the capture volume at will. By using multiple assignment hypotheses and soft decisions, it can robustly recover from a difficult situation with many simultaneous occlusions and false observations (ghost markers). In three experiments, we evaluate the method for labeling a variety of marker configurations for finger and facial capture. We also compare the results with two of the most widely used motion capture platforms: Motion Analysis Cortex and Vicon Blade. The results show that our method is better at attaining correct marker labels and is especially beneficial for real-time applications.

Ort, förlag, år, upplaga, sidor
Elsevier, 2017
Nyckelord
Animation, Motion capture, Hand capture, Labeling
Nationell ämneskategori
Mediateknik
Identifikatorer
urn:nbn:se:kth:diva-218254 (URN)10.1016/j.cag.2017.10.001 (DOI)000418980500008 ()2-s2.0-85032454324 (Scopus ID)
Anmärkning

QC 20171127

Tillgänglig från: 2017-11-24 Skapad: 2017-11-24 Senast uppdaterad: 2018-01-16Bibliografiskt granskad
3. Automatic annotation of gestural units in spontaneous face-to-face interaction
Öppna denna publikation i ny flik eller fönster >>Automatic annotation of gestural units in spontaneous face-to-face interaction
2016 (Engelska)Ingår i: MA3HMI 2016 - Proceedings of the Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, 2016, s. 15-19Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Speech and gesture co-occur in spontaneous dialogue in a highly complex fashion. There is a large variability in the motion that people exhibit during a dialogue, and different kinds of motion occur during different states of the interaction. A wide range of multimodal interface applications, for example in the fields of virtual agents or social robots, can be envisioned where it is important to be able to automatically identify gestures that carry information and discriminate them from other types of motion. While it is easy for a human to distinguish and segment manual gestures from a flow of multimodal information, the same task is not trivial to perform for a machine. In this paper we present a method to automatically segment and label gestural units from a stream of 3D motion capture data. The gestural flow is modeled with a 2-level Hierarchical Hidden Markov Model (HHMM) where the sub-states correspond to gesture phases. The model is trained based on labels of complete gesture units and self-adaptive manipulators. The model is tested and validated on two datasets differing in genre and in method of capturing motion, and outperforms a state-of-the-art SVM classifier on a publicly available dataset.

Nyckelord
Gesture recognition, Motion capture, Spontaneous dialogue, Hidden Markov models, Man machine systems, Markov processes, Online systems, 3D motion capture, Automatic annotation, Face-to-face interaction, Hierarchical hidden markov models, Multi-modal information, Multi-modal interfaces, Classification (of information)
Nationell ämneskategori
Robotteknik och automation
Identifikatorer
urn:nbn:se:kth:diva-202135 (URN)10.1145/3011263.3011268 (DOI)2-s2.0-85003571594 (Scopus ID)9781450345620 (ISBN)
Konferens
2016 Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, MA3HMI 2016, 12 November 2016 through 16 November 2016
Forskningsfinansiär
Vetenskapsrådet, 2010-4646
Anmärkning

Funding text: The work reported here is carried out within the projects: "Timing of intonation and gestures in spoken communication," (P12-0634:1) funded by the Bank of Sweden Tercentenary Foundation, and "Large-scale massively multimodal modelling of non-verbal behaviour in spontaneous dialogue," (VR 2010-4646) funded by Swedish Research Council.

Tillgänglig från: 2017-03-13 Skapad: 2017-03-13 Senast uppdaterad: 2017-11-24Bibliografiskt granskad
4. Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions
Öppna denna publikation i ny flik eller fönster >>Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions
2014 (Engelska)Ingår i: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 28, nr 2, s. 607-618Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a,set of 180 short sentences, under three conditions: normal speech (in quiet), Lombard speech (in noise), and whispering. We then produced an animated 3D avatar with similar shape and appearance as the original talker and used an error minimization procedure to drive the animated version of the talker in a way that matched the original performance as closely as possible. In a perceptual intelligibility study with degraded audio we then compared the animated talker against the real talker and the audio alone, in terms of audio-visual word recognition rate across the three different production conditions. We found that the visual intelligibility of the animated talker was on par with the real talker for the Lombard and whisper conditions. In addition we created two incongruent conditions where normal speech audio was paired with animated Lombard speech or whispering. When compared to the congruent normal speech condition, Lombard animation yields a significant increase in intelligibility, despite the AV-incongruence. In a separate evaluation, we gathered subjective opinions on the different animations, and found that some degree of incongruence was generally accepted.

Nyckelord
Lombard effect, Motion capture, Speech-reading, Lip-reading, Facial animation, Audio-visual intelligibility
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:kth:diva-141052 (URN)10.1016/j.csl.2013.02.005 (DOI)000329415400017 ()2-s2.0-84890567121 (Scopus ID)
Forskningsfinansiär
Vetenskapsrådet, VR 2010-4646
Anmärkning

QC 20140212

Tillgänglig från: 2014-02-12 Skapad: 2014-02-07 Senast uppdaterad: 2018-01-11Bibliografiskt granskad
5. Mimebot—Investigating the Expressibility of Non-Verbal Communication Across Agent Embodiments
Öppna denna publikation i ny flik eller fönster >>Mimebot—Investigating the Expressibility of Non-Verbal Communication Across Agent Embodiments
2017 (Engelska)Ingår i: ACM Transactions on Applied Perception, ISSN 1544-3558, E-ISSN 1544-3965, Vol. 14, nr 4, artikel-id 24Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Unlike their human counterparts, artificial agents such as robots and game characters may be deployed with a large variety of face and body configurations. Some have articulated bodies but lack facial features, and others may be talking heads ending at the neck. Generally, they have many fewer degrees of freedom than humans through which they must express themselves, and there will inevitably be a filtering effect when mapping human motion onto the agent. In this article, we investigate filtering effects on three types of embodiments: (a) an agent with a body but no facial features, (b) an agent with a head only, and (c) an agent with a body and a face. We performed a full performance capture of a mime actor enacting short interactions varying the non-verbal expression along five dimensions (e.g., level of frustration and level of certainty) for each of the three embodiments. We performed a crowd-sourced evaluation experiment comparing the video of the actor to the video of an animated robot for the different embodiments and dimensions. Our findings suggest that the face is especially important to pinpoint emotional reactions but is also most volatile to filtering effects. The body motion, on the other hand, had more diverse interpretations but tended to preserve the interpretation after mapping and thus proved to be more resilient to filtering.

Ort, förlag, år, upplaga, sidor
Association for Computing Machinery (ACM), 2017
Nyckelord
Motion capture, cross-mapping, perception
Nationell ämneskategori
Datorseende och robotik (autonoma system)
Identifikatorer
urn:nbn:se:kth:diva-218270 (URN)10.1145/3127590 (DOI)000415407300003 ()2-s2.0-85029893975 (Scopus ID)
Anmärkning

QC 20171127

Tillgänglig från: 2017-11-24 Skapad: 2017-11-24 Senast uppdaterad: 2018-01-13Bibliografiskt granskad

Open Access i DiVA

fulltext(5328 kB)157 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 5328 kBChecksumma SHA-512
21005ff4411aaf322b0bf698e7eecb69887ceb0c4d0f56e04feba7d8b289640d82a2113f88207d9c0625376586e51c1ec4686a2bcc6b3004f3426422cdb97403
Typ fulltextMimetyp application/pdf

Sök vidare i DiVA

Av författaren/redaktören
Alexanderson, Simon
Av organisationen
Tal, musik och hörsel, TMH
Data- och informationsvetenskap

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 157 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

isbn
urn-nbn

Altmetricpoäng

isbn
urn-nbn
Totalt: 1716 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf