kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
The Nordic Pile: A 1.2 TB Nordic Dataset for Language Modeling
AI Sweden.ORCID iD: 0000-0001-6342-268X
AI Sweden.
AI Sweden.
AI Sweden.ORCID iD: 0000-0002-2236-4978
Show others and affiliations
2023 (English)Manuscript (preprint) (Other academic)
Abstract [en]

Pre-training Large Language Models (LLMs) require massive amounts of text data, and the performance of the LLMs typically correlates with the scale and quality of the datasets. This means that it may be challenging to build LLMs for smaller languages such as Nordic ones, where the availability of text corpora is limited. In order to facilitate the development of the LLMS in the Nordic languages, we curate a high-quality dataset consisting of 1.2TB of text, in all of the major North Germanic languages (Danish, Icelandic, Norwegian, and Swedish), as well as some high-quality English data. This paper details our considerations and processes for collecting, cleaning, and filtering the dataset. 

Place, publisher, year, edition, pages
2023.
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:kth:diva-354951DOI: 10.48550/arXiv.2303.17183OAI: oai:DiVA.org:kth-354951DiVA, id: diva2:1906548
Note

QC 20241023

Available from: 2024-10-17 Created: 2024-10-17 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Authority records

Öhman, JoeyCuba Gyllensten, AmaruGogoulou, Evangelia

Search in DiVA

By author/editor
Öhman, JoeyEkgren, ArielCuba Gyllensten, AmaruIsbister, TimGogoulou, EvangeliaSahlgren, Magnus
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 32 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf