kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Vision-language model-driven scene understanding and robotic object manipulation
KTH, School of Industrial Engineering and Management (ITM), Production engineering. University of Cambridge, Cognition and Brain Sciences Unit, UK; Institute of Bioengineering, Swiss Federal School of Technology in Lausanne, Switzerland.ORCID iD: 0000-0002-1909-0507
Case Western Reserve University, Department of Mechanical and Aerospace Engineering, USA.
Case Western Reserve University, Department of Mechanical and Aerospace Engineering, USA.
KTH, School of Industrial Engineering and Management (ITM), Production engineering.ORCID iD: 0000-0001-9694-0483
Show others and affiliations
2024 (English)In: 2024 IEEE 20th International Conference on Automation Science and Engineering, CASE 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 21-26Conference paper, Published paper (Refereed)
Abstract [en]

Humans often use natural language instructions to control and interact with robots for task execution. This poses a big challenge to robots that need to not only parse and understand human instructions but also realise semantic understanding of an unknown environment and its constituent elements. To address this challenge, this study presents a vision-language model (VLM)-driven approach to scene understanding of an unknown environment to enable robotic object manipulation. Given language instructions, a pre-tained vision-language model built on open-sourced Llama2-chat (7B) as the language model backbone is adopted for image description and scene understanding, which translates visual information into text descriptions of the scene. Next, a zero-shot-based approach to fine-grained visual grounding and object detection is developed to extract and localise objects of interest from the scene task. Upon 3D reconstruction and pose estimate establishment of the object, a code-writing large language model (LLM) is adopted to generate high-level control codes and link language instructions with robot actions for downstream tasks. The performance of the developed approach is experimentally validated through table-top object manipulation by a robot.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2024. p. 21-26
National Category
Robotics and automation Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:kth:diva-356291DOI: 10.1109/CASE59546.2024.10711845Scopus ID: 2-s2.0-85208272331OAI: oai:DiVA.org:kth-356291DiVA, id: diva2:1912875
Conference
20th IEEE International Conference on Automation Science and Engineering, CASE 2024, Bari, Italy, August 28 2024 - September 1 2024
Note

Part of ISBN 9798350358513

QC 20241118

Available from: 2024-11-13 Created: 2024-11-13 Last updated: 2025-02-05Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Liu, SichaoWang, Xi VincentWang, Lihui

Search in DiVA

By author/editor
Liu, SichaoWang, Xi VincentWang, Lihui
By organisation
Production engineering
Robotics and automationComputer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 35 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf