kth.sePublikationer KTH
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Referring Atomic Video Action Recognition
Karlsruhe Institute of Technology, Karlsruhe, Germany.
KTH, Skolan för elektroteknik och datavetenskap (EECS), Datavetenskap. RISE Research Institutes of Sweden, Gothenburg, Sweden.ORCID-id: 0009-0004-3798-8603
Hunan University, Changsha, China.
Karlsruhe Institute of Technology, Karlsruhe, Germany.
Visa övriga samt affilieringar
2025 (Engelska)Ingår i: Computer Vision – ECCV 2024 - 18th European Conference, Proceedings, Springer Nature , 2025, s. 166-185Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed at identifying atomic actions of a particular person based on a textual description and the video data of this person. This task differs from traditional action recognition and localization, where predictions are delivered for all present individuals. In contrast, we focus on recognizing the correct atomic action of a specific individual, guided by text. To explore this task, we present the RefAVA dataset, containing 36, 630 instances with manually annotated textual descriptions of the individuals. To establish a strong initial benchmark, we implement and validate baselines from various domains, e.g., atomic action localization, video question answering, and text-video retrieval. Since these existing methods underperform on RAVAR, we introduce RefAtomNet – a novel cross-stream attention-driven method specialized for the unique challenges of RAVAR: the need to interpret a textual referring expression for the targeted individual, utilize this reference to guide the spatial localization and harvest the prediction of the atomic actions for the referring person. The key ingredients are: (1) a multi-stream architecture that connects video, text, and a new location-semantic stream, and (2) cross-stream agent attention fusion and agent token fusion which amplify the most relevant information across these streams and consistently surpasses standard attention-based fusion on RAVAR. Extensive experiments demonstrate the effectiveness of RefAtomNet and its building blocks for recognizing the action of the described individual. The dataset and code will be made publicly available at RAVAR.

Ort, förlag, år, upplaga, sidor
Springer Nature , 2025. s. 166-185
Nationell ämneskategori
Datorgrafik och datorseende
Identifikatorer
URN: urn:nbn:se:kth:diva-358213DOI: 10.1007/978-3-031-72655-2_10Scopus ID: 2-s2.0-85213009172OAI: oai:DiVA.org:kth-358213DiVA, id: diva2:1924847
Konferens
18th European Conference on Computer Vision, ECCV 2024, Milan, Italy, Sep 29 2024 - Oct 4 2024
Anmärkning

Part of ISBN 9783031726545

QC 20250114

Tillgänglig från: 2025-01-07 Skapad: 2025-01-07 Senast uppdaterad: 2025-02-07Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Övriga länkar

Förlagets fulltextScopus

Person

Fu, Jia

Sök vidare i DiVA

Av författaren/redaktören
Fu, Jia
Av organisationen
Datavetenskap
Datorgrafik och datorseende

Sök vidare utanför DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetricpoäng

doi
urn-nbn
Totalt: 59 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf