kth.sePublications KTH
Operational message
There are currently operational disruptions. Troubleshooting is in progress.
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Using LLMs to Grade Clinical Reasoning for Medical Students in Virtual Patient Dialogues
KTH, School of Electrical Engineering and Computer Science (EECS). Karolinska Institute, Sweden.ORCID iD: 0009-0001-0445-630X
KTH, School of Electrical Engineering and Computer Science (EECS).ORCID iD: 0009-0004-8417-6106
Karolinska Institute, Sweden.
Karolinska Institute, Sweden.ORCID iD: 0000-0002-4875-5395
Show others and affiliations
2025 (English)In: Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue / [ed] Frédéric Béchet, Fabrice Lefèvre, Nicholas Asher, Seokhwan Kim, Teva Merlin, SIGDIAL , 2025, p. 750-763Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents an evaluation of the use of large language models (LLMs) for grading clinical reasoning during rheumatology medical history virtual patient (VP) simulations. The study explores the feasibility of using state-of-the-art LLMs, including both general-purpose models, with various prompting strategies such as zero-shot, analysis-first, and chain-of-thought prompting, as well as reasoning models. The performance of these models in grading transcribed dialogues from VP simulations conducted on a Furhat robot was evaluated against human expert annotations. Human experts initially achieved a 65% inter-rater agreement, which resulted in a pooled Cohen’s Kappa of 0.71 and 82.3% correctness. The best LLM, o3-mini, achieved a pooled Kappa of 0.68 and 81.5% correctness, with response times under 30 seconds, compared to approximately 6 minutes for human grading. These results indicate the possibility that automatic assessments can approach human reliability under controlled simulation conditions while delivering time and cost efficiencies.

Place, publisher, year, edition, pages
SIGDIAL , 2025. p. 750-763
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-374882OAI: oai:DiVA.org:kth-374882DiVA, id: diva2:2025321
Conference
The 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Avignon, France, Aug 25-27, 2025
Note

Part of ISBN 979-8-89176-329-6

QC 20260107

Available from: 2026-01-06 Created: 2026-01-06 Last updated: 2026-01-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Paper

Authority records

Schiött, JonathanIvegren, WilliamSkantze, Gabriel

Search in DiVA

By author/editor
Schiött, JonathanIvegren, WilliamParodis, IoannisSkantze, Gabriel
By organisation
School of Electrical Engineering and Computer Science (EECS)Speech, Music and Hearing, TMH
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 18 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf