kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Developing Data-Driven Models for Understanding Human Motion
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0002-7189-1336
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Humans are the primary subjects of interest in the realm of computer vision. Specifically, perceiving, generating, and understanding human activities have long been a core pursuit of machine intelligence. Over the past few decades, data-driven methods for modeling human motion have demonstrated great potential across various interactive media and social robotics domains. Despite its impressive achievements, challenges still remain in analyzing multi-agent/multi-modal behaviors and in producing high-fidelity and highly varied motions. This complexity arises because human motion is inherently dynamic, uncertain, and intertwined with its environment. This thesis aims to introduce challenges and data-driven methods of understanding human motion and then elaborate on the contributions of the included papers. We present this thesis mainly in ascending order of complexity: recognition, synthesis, and transfer, which includes the tasks of perceiving, generating, and understanding human activities. 

Firstly, we present methods to recognize human motion (Paper A). We consider a conversational group scenario where people gather and stand in an environment to converse. Based on transformer-based networks and graph convolutional neural networks, we demonstrate how spatial-temporal group dynamics can be modeled and perceived on both the individual and group levels. Secondly, we investigate probabilistic autoregressive approaches to generate controllable human locomotion. We employ deep generative models, namely normalizing flows (Paper B) and diffusion models (Paper C), to generate and reconstruct the 3D skeletal poses of humans over time. Finally, we deal with the problem of motion style transfer. We propose style transfer systems that allow transforming motion styles while attempting to preserve motion context through GAN-based (Paper D) and diffusion-based (Paper E) methods. Compared with previous research mainly focusing on simple locomotion or exercise, we consider more complex dance movements and multimodal information. 

In summary, this thesis aims to propose methods that can effectively perceive, generate, and transfer 3D human motion. In terms of network architectures, we employ graph formulation to exploit the correlation of human skeletons, thereby introducing inductive bias through graph structures. Additionally, we leverage transformers to handle long-term data dependencies and weigh the importance of varying data components. In terms of learning frameworks, we adopt generative models to represent joint distribution over relevant variables and multiple modalities, which are flexible to cover a wide range of tasks. Our experiments demonstrate the effectiveness of the proposed frameworks by evaluating the methods on our own collected dataset and public datasets. We show how these methods are applied to various challenging tasks. 

Abstract [sv]

Människor är av primärt intresse för studier inom ämnet datorseende. Mer specifikt, att uppfatta, generera och förstå mänskliga aktiviteter har länge varit en huvudsaklig strävan inom maskinintelligens. Under de senaste årtiondena har datadrivna metoder för modellering av mänsklig rörelse visat stor potential inom olika interaktiva medier och områden för social robotik. Trots dess imponerande framgångar kvarstår utmaningar i att analysera multiagent/multimodal-beteenden och producera högupplösta och mycket varierade rörelser. Denna komplexitet uppstår eftersom mänsklig rörelse i grunden är dynamisk, osäker och sammanflätad med sin miljö. Denna avhandling syftar till att introducera utmaningar och datadrivna metoder för att förstå mänsklig rörelse och sedan beskriva bidragen från de inkluderade artiklarna. Vi presenterar denna avhandling huvudsakligen i stigande ordning av komplexitet: igenkänning, syntes och överföring, vilket inkluderar uppgifterna att uppfatta, generera och förstå mänskliga aktiviteter.

Först presenterar vi metoder för att känna igen mänsklig rörelse (Artikel A). Vi beaktar ett konversationsgruppsscenario där människor samlas och står i en miljö för att samtala. Baserat på transformer-baserade nätverk och graf-faltade neurala nätverk visar vi hur rumsligt-temporal gruppdynamik kan modelleras och uppfattas på både individ- och gruppnivåer. För det andra undersöker vi probabilistiska autoregressiva metoder för att generera kontrollerbar mänsklig rörelse. Vi använder djupa generativa modeller, nämligen normaliserande flöden (Artikel B) och diffusionsmodeller (Artikel C), för att generera och rekonstruera 3D-skelettpositioner av människor över tid. Slutligen behandlar vi problemet med översättning av rörelsestilar. Vi föreslår ett stilöversättningssystem som möjliggör omvandling av rörelsestilar samtidigt som det försöker bevara rörelsesammanhang genom GAN-baserade (Artikel D) och diffusionsbaserade (Artikel E) metoder. Jämfört med tidigare forskning som huvudsakligen fokuserar på enkel rörelse eller träning, beaktar vi mer komplexa dansrörelser och multimodal information.

Sammanfattningsvis syftar denna avhandling till att föreslå metoder som effektivt kan uppfatta, generera och översätta mänsklig rörelse i 3D. När det gäller nätverksarkitekturer använder vi en graf-formulering för att utnyttja korrelationen av mänskliga skelett, därigenom introducera induktiv bias genom grafstrukturer. Dessutom utnyttjar vi transformer för att hantera långsiktiga databeroenden och väga betydelsen av varierande komponenter i datan.När det gäller ramverk för inlärning tillämpar vi generativa modeller för att representera gemensam distribution över relevanta variabler och flera modaliteter, vilka är flexibla nog att täcka ett brett spektrum av uppgifter. Våra experiment visar effektiviteten av de föreslagna ramverken genom att utvärdera metoderna på egna insamlade dataset och offentliga dataset. Vi visar hur dessa metoder tillämpas för flertalet utmanande uppgifter.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2024. , p. xiii, 68
Series
TRITA-EECS-AVL ; 2024:9
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-342366ISBN: 978-91-8040-815-8 (print)OAI: oai:DiVA.org:kth-342366DiVA, id: diva2:1828413
Public defence
2024-02-16, https://kth-se.zoom.us/j/62347635904, F3, Lindstedtsvägen 26, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20240117

Available from: 2024-01-17 Created: 2024-01-16 Last updated: 2024-02-05Bibliographically approved
List of papers
1. Group Behavior Recognition Using Attention- and Graph-Based Neural Networks
Open this publication in new window or tab >>Group Behavior Recognition Using Attention- and Graph-Based Neural Networks
Show others...
2020 (English)In: ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020Conference paper, Published paper (Refereed)
Abstract [en]

When a conversational group is approached by a new-comer who wishes to join it, the group may dynamically react byadjusting their positions and orientations in order to accommodate it.These reactions represent important cues to the newcomer about ifand how they should plan their approach. The recognition and anal-ysis of such socially complaint dynamic group behaviors have rarelybeen studied in depth and remain a challenging problem in socialmulti-agent systems. In this paper, we present novel group behaviorrecognition models, attention-based and graph-based, that considerbehaviors on both the individual and group levels. The attention-based category consists of Approach Group Net (AGNet) and Ap-proach Group Transformer (AGTransformer). They share a similararchitecture and use attention mechanisms to encode both tempo-ral and spatial information on both the individual and group levels.The graph-based models consist of Approach Group Graph Convolu-tional Networks (AG-GCN), which combine Multi-Spatial-TemporalGraph Convolutional Networks (MST-GCN) on the individual leveland Graph Convolutional Networks (GCN) on the group level, withmulti-temporal stages. The individual level learns the spatial andtemporal movement patterns of each agent, while the group levelcaptures the relations and interactions of multiple agents. In orderto train and evaluate these models, we collected a full-body motion-captured dataset of multiple individuals in conversational groups. Ex-periments performed using our models to recognize group behaviorsfrom the collected dataset show that AG-GCN, with additional dis-tance and orientation information, achieves the best performance. Wealso present a multi-agent interaction use case in a virtual environ-ment to show how the models can be practically applied

Series
Frontiers in Artificial Intelligence and Applications
National Category
Computer graphics and computer vision Computer Sciences
Identifiers
urn:nbn:se:kth:diva-287334 (URN)10.3233/FAIA200273 (DOI)000650971301111 ()2-s2.0-85091792981 (Scopus ID)
Conference
the 24th European Conference on Artificial Intelligence (ECAI), 29 aug -8 sep, 2020
Note

QC 20210621

Available from: 2020-12-07 Created: 2020-12-07 Last updated: 2025-02-01Bibliographically approved
2. Graph-based Normalizing Flow for Human Motion Generation and Reconstruction
Open this publication in new window or tab >>Graph-based Normalizing Flow for Human Motion Generation and Reconstruction
2021 (English)In: 2021 30th IEEE international conference on robot and human interactive communication (RO-MAN), Institute of Electrical and Electronics Engineers (IEEE) , 2021, p. 641-648Conference paper, Published paper (Refereed)
Abstract [en]

Data-driven approaches for modeling human skeletal motion have found various applications in interactive media and social robotics. Challenges remain in these fields for generating high-fidelity samples and robustly reconstructing motion from imperfect input data, due to e.g. missed marker detection. In this paper, we propose a probabilistic generative model to synthesize and reconstruct long horizon motion sequences conditioned on past information and control signals, such as the path along which an individual is moving. Our method adapts the existing work MoGlow by introducing a new graph-based model. The model leverages the spatial-temporal graph convolutional network (ST-GCN) to effectively capture the spatial structure and temporal correlation of skeletal motion data at multiple scales. We evaluate the models on a mixture of motion capture datasets of human locomotion with foot-step and bone-length analysis. The results demonstrate the advantages of our model in reconstructing missing markers and achieving comparable results on generating realistic future poses. When the inputs are imperfect, our model shows improvements on robustness of generation.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2021
Series
IEEE RO-MAN, ISSN 1944-9445
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-305502 (URN)10.1109/RO-MAN50785.2021.9515316 (DOI)000709817200093 ()2-s2.0-85115049506 (Scopus ID)
Conference
30th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), AUG 08-12, 2021, ELECTR NETWORK
Note

QC 20211201

Part of proceedings: ISBN 978-1-6654-0492-1

Available from: 2021-12-01 Created: 2021-12-01 Last updated: 2024-01-17Bibliographically approved
3. Controllable Motion Synthesis and Reconstruction with Autoregressive Diffusion Models
Open this publication in new window or tab >>Controllable Motion Synthesis and Reconstruction with Autoregressive Diffusion Models
Show others...
2023 (English)In: 2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN, Institute of Electrical and Electronics Engineers (IEEE) , 2023, p. 1102-1108Conference paper, Published paper (Refereed)
Abstract [en]

Data-driven and controllable human motion synthesis and prediction are active research areas with various applications in interactive media and social robotics. Challenges remain in these fields for generating diverse motions given past observations and dealing with imperfect poses. This paper introduces MoDiff, an autoregressive probabilistic diffusion model over motion sequences conditioned on control contexts of other modalities. Our model integrates a cross-modal Transformer encoder and a Transformer-based decoder, which are found effective in capturing temporal correlations in motion and control modalities. We also introduce a new data dropout method based on the diffusion forward process to provide richer data representations and robust generation. We demonstrate the superior performance of MoDiff in controllable motion synthesis for locomotion with respect to two baselines and show the benefits of diffusion data dropout for robust synthesis and reconstruction of high-fidelity motion close to recorded data.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Series
IEEE RO-MAN, ISSN 1944-9445
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-341978 (URN)10.1109/RO-MAN57019.2023.10309317 (DOI)001108678600131 ()2-s2.0-85186990309 (Scopus ID)
Conference
32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), AUG 28-31, 2023, Busan, SOUTH KOREA
Note

Part of proceedings ISBN 979-8-3503-3670-2

QC 20240110

Available from: 2024-01-10 Created: 2024-01-10 Last updated: 2025-02-07Bibliographically approved
4. Multimodal dance style transfer
Open this publication in new window or tab >>Multimodal dance style transfer
Show others...
2023 (English)In: Machine Vision and Applications, ISSN 0932-8092, E-ISSN 1432-1769, Vol. 34, no 4, article id 48Article in journal (Refereed) Published
Abstract [en]

This paper first presents CycleDance, a novel dance style transfer system that transforms an existing motion clip in one dance style into a motion clip in another dance style while attempting to preserve the motion context of the dance. CycleDance extends existing CycleGAN architectures with multimodal transformer encoders to account for the music context. We adopt a sequence length-based curriculum learning strategy to stabilize training. Our approach captures rich and long-term intra-relations between motion frames, which is a common challenge in motion transfer and synthesis work. Building upon CycleDance, we further propose StarDance, which enables many-to-many mappings across different styles using a single generator network. Additionally, we introduce new metrics for gauging transfer strength and content preservation in the context of dance movements. To evaluate the performance of our approach, we perform an extensive ablation study and a human study with 30 participants, each with 5 or more years of dance experience. Our experimental results show that our approach can generate realistic movements with the target style, outperforming the baseline CycleGAN and its variants on naturalness, transfer strength, and content preservation. Our proposed approach has potential applications in choreography, gaming, animation, and tool development for artistic and scientific innovations in the field of dance.

Place, publisher, year, edition, pages
Springer Nature, 2023
Keywords
Style transfer, Dance motion, Multimodal learning, Generative models
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-328307 (URN)10.1007/s00138-023-01399-x (DOI)000984951800001 ()2-s2.0-85158999932 (Scopus ID)
Note

QC 20230607

Available from: 2023-06-07 Created: 2023-06-07 Last updated: 2024-01-17Bibliographically approved
5. Scalable Motion Style Transfer with Constrained Diffusion Generation
Open this publication in new window or tab >>Scalable Motion Style Transfer with Constrained Diffusion Generation
Show others...
2024 (English)In: Proceedings of the 38th AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence (AAAI) , 2024, Vol. 38, p. 10234-10242Conference paper, Published paper (Refereed)
Abstract [en]

Current training of motion style transfer systems relies on consistency losses across style domains to preserve contents, hindering its scalable application to a large number of domains and private data. Recent image transfer works show the potential of independent training on each domain by leveraging implicit bridging between diffusion models, with the content preservation, however, limited to simple data patterns. We address this by imposing biased sampling in backward diffusion while maintaining the domain independence in the training stage. We construct the bias from the source domain keyframes and apply them as the gradient of content constraints, yielding a framework with keyframe manifold constraint gradients (KMCGs). Our validation demonstrates the success of training separate models to transfer between as many as ten dance motion styles. Comprehensive experiments find a significant improvement in preserving motion contents in comparison to baseline and ablative diffusion-based style transfer models. In addition, we perform a human study for a subjective assessment of the quality of generated dance motions. The results validate the competitiveness of KMCGs.

Place, publisher, year, edition, pages
Association for the Advancement of Artificial Intelligence (AAAI), 2024
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-342365 (URN)10.1609/aaai.v38i9.28889 (DOI)001241512400092 ()2-s2.0-85189340183 (Scopus ID)
Conference
The 38th Annual AAAI Conference on Artificial Intelligence, February 20-27, 2024, Vancouver, Canada
Note

QC 20241112

Available from: 2024-01-16 Created: 2024-01-16 Last updated: 2024-11-12Bibliographically approved

Open Access in DiVA

fulltext(4949 kB)534 downloads
File information
File name FULLTEXT02.pdfFile size 4949 kBChecksum SHA-512
f21ef395a89d016eb19cba46509312385d39ef03adf33a5b3478bb3c007c2dee2a33a10bc49d89b796f52afa91ab2a5da3da8fd737ce1361fdbca3b93ab5465c
Type fulltextMimetype application/pdf

Authority records

Yin, Wenjie

Search in DiVA

By author/editor
Yin, Wenjie
By organisation
Robotics, Perception and Learning, RPL
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 534 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 826 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf