kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Resolving References in Visually-Grounded Dialogue via Text Generation
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-2140-0612
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-7885-5477
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-8579-1790
2023 (English)In: Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue / [ed] David Schlangen, Svetlana Stoyanchev, Shafiq Joty, Ondrej Dusek, Casey Kennington, Malihe Alikhani, Prague, Czechia: Association for Computational Linguistics (ACL) , 2023, p. 457-469Conference paper, Published paper (Refereed)
Abstract [en]

Vision-language models (VLMs) have shown to be effective at image retrieval based on simple text queries, but text-image retrieval based on conversational input remains a challenge. Consequently, if we want to use VLMs for reference resolution in visually-grounded dialogue, the discourse processing capabilities of these models need to be augmented. To address this issue, we propose fine-tuning a causal large language model (LLM) to generate definite descriptions that summarize coreferential information found in the linguistic context of references. We then use a pretrained VLM to identify referents based on the generated descriptions, zero-shot. We evaluate our approach on a manually annotated dataset of visually-grounded dialogues and achieve results that, on average, exceed the performance of the baselines we compare against. Furthermore, we find that using referent descriptions based on larger context windows has the potential to yield higher returns.

Place, publisher, year, edition, pages
Prague, Czechia: Association for Computational Linguistics (ACL) , 2023. p. 457-469
National Category
Natural Language Processing
Research subject
Computer Science; Human-computer Interaction
Identifiers
URN: urn:nbn:se:kth:diva-339204ISI: 001274996900041OAI: oai:DiVA.org:kth-339204DiVA, id: diva2:1809590
Conference
The 24th Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2023), Prague, Czechia, 11 - 15 September
Projects
tmh_grounding
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20231106

Part of ISBN 979-8-89176-028-8

Available from: 2023-11-04 Created: 2023-11-04 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Full textProceedings

Authority records

Willemsen, BramQian, LiviaSkantze, Gabriel

Search in DiVA

By author/editor
Willemsen, BramQian, LiviaSkantze, Gabriel
By organisation
Speech, Music and Hearing, TMH
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 294 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf