kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis
Stockholm Environm Inst, S-11523 Stockholm, Sweden; Lund Univ, Environm & Energy Syst Studies, S-22100 Lund, Sweden.
Stockholm Environm Inst, S-11523 Stockholm, Sweden.
KTH, School of Industrial Engineering and Management (ITM), Energy Technology, Energy Systems.ORCID iD: 0000-0003-2896-8841
Stockholm Environm Inst, S-11523 Stockholm, Sweden.
2025 (English)In: Environmental Evidence, E-ISSN 2047-2382, Vol. 14, no 1, article id 7Article in journal (Refereed) Published
Abstract [en]

In this paper we show that OpenAI's Large Language Model (LLM) GPT perform remarkably well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow. We evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12,000 records using the same eligibility criteria as human screeners. We tested 3 different versions of this model that were tasked to distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1. For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5 the recall rate is 100%, meaning no relevant papers were missed and using this mode for screening would have saved 50% of the time that would otherwise be spent on manual screening. Experimenting with a higher cut of threshold can save more time. With threshold chosen so that recall is still above 95% for GPT-4 (where up to 5% of relevant papers might be missed), the model could save 75% of the time spent on manual screening. If automation technologies can replicate manual screening by human experts with effectiveness, accuracy, and precision, the work and cost savings are significant. Furthermore, the value of a comprehensive list of relevant literature, rather quickly available at the start of a research project, is hard to understate. However, as this study only evaluated the performance on one systematic review and one prompt, we caution that more test and methodological development is needed, and outline the next steps to properly evaluate rigor and effectiveness of LLMs for eligibility screening.

Place, publisher, year, edition, pages
Springer Nature , 2025. Vol. 14, no 1, article id 7
Keywords [en]
Artificial Intelligence, Large Language Model, Study selection, Systematic maps, Systematic reviews
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-364246DOI: 10.1186/s13750-025-00360-xISI: 001473012600001PubMedID: 40270055Scopus ID: 2-s2.0-105003803065OAI: oai:DiVA.org:kth-364246DiVA, id: diva2:1967075
Note

QC 20250611

Available from: 2025-06-11 Created: 2025-06-11 Last updated: 2025-10-10Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textPubMedScopus

Authority records

Xylia, Maria

Search in DiVA

By author/editor
Xylia, Maria
By organisation
Energy Systems
In the same journal
Environmental Evidence
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 221 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf