kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Empowering natural human–robot collaboration through multimodal language models and spatial intelligence: Pathways and perspectives
School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China; State Key Laboratory of Mechanical System and Vibration, Shanghai Jiao Tong University, Shanghai, China.
Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region of China.
School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China; State Key Laboratory of Mechanical System and Vibration, Shanghai Jiao Tong University, Shanghai, China.
School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China; State Key Laboratory of Mechanical System and Vibration, Shanghai Jiao Tong University, Shanghai, China.
Show others and affiliations
2026 (English)In: Robotics and Computer-Integrated Manufacturing, ISSN 0736-5845, E-ISSN 1879-2537, Vol. 97, article id 103064Article, review/survey (Refereed) Published
Abstract [en]

Industry 5.0 advocates human-centric smart manufacturing (HSM), with growing attention to proactive human-machine collaboration (HRC). Meanwhile, the rapid development of Multimodal large language models (MLLMs) and embodied intelligence is driving an unprecedented evolution. This work aims to leverage these opportunities to enhance robots’ learning and cognitive capabilities, enabling seamless and natural interaction. However, current research often overlooks human–robot symbiosis and lacks attention to specialized models and practical applications. This review adheres to a human-centric vision, taking language as the pivot to connect humans with large models. To our best knowledge, this is the first attempt to integrate HRC, MLLMs and embodied intelligence into a holistic view. The review first introduces representative foundation models to provide a comprehensive summary of state-of-the-art methods in the ”Perception-Cognition-Actuation” loop. It then discusses pathways and platforms for efficient spatial skills learning, followed by an analysis of four key questions from the ”Why, How, What, Where” perspectives. Finally, it highlights future challenges and potential research directions. It is hoped that this work can help fill the research gap between HRC and MLLMs, offering a systematic pathway for developing human-centered collaborative systems and promoting further exploration and innovation in this exciting and crucial field. The resources are available at: https://github.com/WuDuidi/MLLM-HRC-Survey.

Place, publisher, year, edition, pages
Elsevier BV , 2026. Vol. 97, article id 103064
Keywords [en]
Embodied intelligence, Human–robot collaboration, Large language model, Robot learning, Smart manufacturing
National Category
Production Engineering, Human Work Science and Ergonomics Robotics and automation
Identifiers
URN: urn:nbn:se:kth:diva-368601DOI: 10.1016/j.rcim.2025.103064ISI: 001514039100001Scopus ID: 2-s2.0-105007620255OAI: oai:DiVA.org:kth-368601DiVA, id: diva2:1990132
Note

QC 20250819

Available from: 2025-08-19 Created: 2025-08-19 Last updated: 2025-09-26Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Wang, Lihui

Search in DiVA

By author/editor
Wang, Lihui
By organisation
Industrial Production Systems
In the same journal
Robotics and Computer-Integrated Manufacturing
Production Engineering, Human Work Science and ErgonomicsRobotics and automation

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 110 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf