kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-2140-0612
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-8579-1790
2025 (English)Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the task represents an inherently multimodal problem and discuss limitations fundamental to unimodal approaches.

Place, publisher, year, edition, pages
2025.
Keywords [en]
Computer Science - Computation and Language, Computer Science - Artificial Intelligence
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-374883OAI: oai:DiVA.org:kth-374883DiVA, id: diva2:2025322
Conference
XLLM @ ACL 2025, The 1st Joint Workshop on Large Language Models and Structure Modeling, Vienna, Austria, Aug 1st, 2025
Note

QC 20260107

Available from: 2026-01-06 Created: 2026-01-06 Last updated: 2026-01-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Accepted manuscript

Authority records

Willemsen, BramSkantze, Gabriel

Search in DiVA

By author/editor
Willemsen, BramSkantze, Gabriel
By organisation
Speech, Music and Hearing, TMH
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 40 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf