kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Data-efficient multimodal human action recognition for proactive human–robot collaborative assembly: A cross-domain few-shot learning approach
UM-SJTU Joint Institute, Shanghai Jiao Tong University, Shanghai, cy=China.
KTH, School of Industrial Engineering and Management (ITM), Production engineering.ORCID iD: 0000-0002-0222-912x
KTH, School of Industrial Engineering and Management (ITM), Production engineering.ORCID iD: 0000-0001-8679-8049
Global Institute of Future Technology, Shanghai Jiao Tong University, Shanghai, cy=China.
Show others and affiliations
2024 (English)In: Robotics and Computer-Integrated Manufacturing, ISSN 0736-5845, E-ISSN 1879-2537, Vol. 89, article id 102785Article in journal (Refereed) Published
Abstract [en]

With the recent vision of Industry 5.0, the cognitive capability of robots plays a crucial role in advancing proactive human–robot collaborative assembly. As a basis of the mutual empathy, the understanding of a human operator's intention has been primarily studied through the technique of human action recognition. Existing deep learning-based methods demonstrate remarkable efficacy in handling information-rich data such as physiological measurements and videos, where the latter category represents a more natural perception input. However, deploying these methods in new unseen assembly scenarios requires first collecting abundant case-specific data. This leads to significant manual effort and poor flexibility. To deal with the issue, this paper proposes a novel cross-domain few-shot learning method for data-efficient multimodal human action recognition. A hierarchical data fusion mechanism is designed to jointly leverage the skeletons, RGB images and depth maps with complementary information. Then a temporal CrossTransformer is developed to enable the action recognition with very limited amount of data. Lightweight domain adapters are integrated to further improve the generalization with fast finetuning. Extensive experiments on a real car engine assembly case show the superior performance of proposed method over state-of-the-art regarding both accuracy and finetuning efficiency. Real-time demonstrations and ablation study further indicate the potential of early recognition, which is beneficial for the robot procedures generation in practical applications. In summary, this paper contributes to the rarely explored realm of data-efficient human action recognition for proactive human–robot collaboration.

Place, publisher, year, edition, pages
Elsevier BV , 2024. Vol. 89, article id 102785
Keywords [en]
Cross-domain few-shot learning, Data-efficient, Human action recognition, Human–robot collaborative assembly, Multimodal
National Category
Signal Processing
Identifiers
URN: urn:nbn:se:kth:diva-346813DOI: 10.1016/j.rcim.2024.102785ISI: 001242317200001Scopus ID: 2-s2.0-85192910539OAI: oai:DiVA.org:kth-346813DiVA, id: diva2:1860427
Note

QC 20240626

Available from: 2024-05-24 Created: 2024-05-24 Last updated: 2024-06-26Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Liu, ZhihaoWang, LihuiWang, Xi Vincent

Search in DiVA

By author/editor
Liu, ZhihaoWang, LihuiWang, Xi Vincent
By organisation
Production engineering
In the same journal
Robotics and Computer-Integrated Manufacturing
Signal Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 11 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf