kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A benchmark of expert-level academic questions to assess AI capabilities
Ctr AI Safety, San Francisco, CA 94111 USA.
Ctr AI Safety, San Francisco, CA 94111 USA.
KTH, School of Engineering Sciences (SCI), Mathematics (Dept.), Mathematics (Div.).
KTH, School of Engineering Sciences (SCI), Mathematics (Dept.), Mathematics (Div.).
Show others and affiliations
2026 (English)In: Nature, ISSN 0028-0836, E-ISSN 1476-4687, Vol. 649, no 8099, p. 1139-+Article in journal (Refereed) Published
Abstract [en]

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding(1), limiting informed measurement of state-of-the-art LLM capabilities. Here, in response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered by internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

Place, publisher, year, edition, pages
Springer Nature , 2026. Vol. 649, no 8099, p. 1139-+
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-379198DOI: 10.1038/s41586-025-09962-4ISI: 001679529600001PubMedID: 41606155Scopus ID: 2-s2.0-105028928953OAI: oai:DiVA.org:kth-379198DiVA, id: diva2:2056543
Note

QC 20260429

Available from: 2026-04-29 Created: 2026-04-29 Last updated: 2026-04-29Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textPubMedScopus

Authority records

Ängquist, IvarGustafsson, NilsVerkama, Emil

Search in DiVA

By author/editor
Ängquist, IvarGustafsson, NilsVerkama, Emil
By organisation
Mathematics (Div.)Algebra, Combinatorics and Topology
In the same journal
Nature
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 12 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf