kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
GenCeption: Evaluate vision LLMs with unlabeled unimodal data
Microsoft Gaming (ABK), Stockholm, Sweden; EQT Group (Motherbrain), Stockholm, Sweden.
EQT Group (Motherbrain), Stockholm, Sweden; Chapter Two, Stockholm, Sweden.
KTH. EQT Group (Motherbrain), Stockholm, Sweden; Télécom Paris, Palaiseau, France; Fever Energy, Stockholm, Sweden.ORCID iD: 0009-0001-6451-0136
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).ORCID iD: 0000-0002-3089-0345
2025 (English)In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 93, article id 101785Article in journal (Refereed) Published
Abstract [en]

Multimodal Large Language Models (MLLMs) are typically assessed using expensive annotated multimodal benchmarks, which often lag behind the rapidly evolving demands of MLLM evaluation. This paper outlines and validates GenCeption, a novel, annotation-free evaluation method that requires only unimodal data to measure inter-modality semantic coherence and inversely assesses MLLMs’ tendency to hallucinate. This approach eliminates the need for costly data annotation, minimizes the risk of training data contamination, is expected to result in slower benchmark saturation, and avoids the illusion of emerging abilities. Inspired by the DrawCeption game, GenCeption begins with a non-textual sample and proceeds through iterative description and generation steps. The semantic drift across iterations is quantified using the GC@T metric. While GenCeption is principally applicable to MLLMs across various modalities, this paper focuses on its implementation and validation for Vision LLMs (VLLMs). Based on the GenCeption method, we establish the MMECeption benchmark for evaluating VLLMs, and compare the performance of several popular VLLMs and human annotators. Our empirical results validate GenCeption's effectiveness, demonstrating strong correlations with established VLLM benchmarks. VLLMs still significantly lag behind human performance and struggle especially with text-intensive tasks.

Place, publisher, year, edition, pages
Elsevier BV , 2025. Vol. 93, article id 101785
Keywords [en]
Benchmark, Evaluation, Multimodal large language model
National Category
Computer graphics and computer vision Natural Language Processing
Identifiers
URN: urn:nbn:se:kth:diva-361201DOI: 10.1016/j.csl.2025.101785Scopus ID: 2-s2.0-85219499531OAI: oai:DiVA.org:kth-361201DiVA, id: diva2:1944156
Note

QC 20250313

Available from: 2025-03-12 Created: 2025-03-12 Last updated: 2025-03-13Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Senane, ZinebYang, Fangkai

Search in DiVA

By author/editor
Senane, ZinebYang, Fangkai
By organisation
KTHComputational Science and Technology (CST)
In the same journal
Computer speech & language (Print)
Computer graphics and computer visionNatural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 29 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf