kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0002-6679-4021
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0002-7882-948X
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
Scania CV AB, Research and Development, Södertälje, Sweden, 151 87.
Show others and affiliations
2024 (English)In: 35th IEEE Intelligent Vehicles Symposium, IV 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 2405-2412Conference paper, Published paper (Refereed)
Abstract [en]

Addressing hard cases in autonomous driving, such as anomalous road users, extreme weather conditions, and complex traffic interactions, presents significant challenges. To ensure safety, it is crucial to detect and manage these scenarios effectively for autonomous driving systems. However, the rarity and high-risk nature of these cases demand extensive, diverse datasets for training robust models. Vision-Language Foundation Models (VLMs) have shown remarkable zero-shot capabilities as being trained on extensive datasets. This work explores the potential of VLMs in detecting hard cases in autonomous driving. We demonstrate the capability of VLMs such as GPT-4v in detecting hard cases in traffic participant motion prediction on both agent and scenario levels. We introduce a feasible pipeline where VLMs, fed with sequential image frames with designed prompts, effectively identify challenging agents or scenarios, which are verified by existing prediction models. Moreover, by taking advantage of this detection of hard cases by VLMs, we further improve the training efficiency of the existing motion prediction pipeline by performing data selection for the training samples suggested by GPT. We show the effectiveness and feasibility of our pipeline incorporating VLMs with state-of-the-art methods on NuScenes datasets. The code is accessible at https://github.com/KTH-RPL/Detect-VLM.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2024. p. 2405-2412
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:kth:diva-351756DOI: 10.1109/IV55156.2024.10588694ISI: 001275100902068Scopus ID: 2-s2.0-85199784263OAI: oai:DiVA.org:kth-351756DiVA, id: diva2:1888723
Conference
35th IEEE Intelligent Vehicles Symposium, IV 2024, Jeju Island, Korea, Jun 2 2024 - Jun 5 2024
Note

Part of ISBN [9798350348811]

QC 20240815

Available from: 2024-08-13 Created: 2024-08-13 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Yang, YiZhang, QingwenFolkesson, John

Search in DiVA

By author/editor
Yang, YiZhang, QingwenIkemura, KeiFolkesson, John
By organisation
Robotics, Perception and Learning, RPL
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 43 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf