kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Bridging the Data Gap: Using LLMs to Augment Datasets for Text Classification
EPFL, Lausanne, Switzerland.
KTH, School of Industrial Engineering and Management (ITM), Learning, Digital Learning.ORCID iD: 0000-0002-6175-9200
EPFL, Lausanne, Switzerland.
EPFL, Lausanne, Switzerland.
Show others and affiliations
2025 (English)In: Proceedings of the 18th International Conference on Educational Data Mining, 2025, p. 119-132Conference paper, Published paper (Refereed)
Abstract [en]

Deep learning models for text classification have been increasingly used in intelligent tutoring systems and educational writing assistants. However, the scarcity of data in many educational settings, as well as certain imbalances in counts among the annotated labels of educational datasets, limits the generalizability and expressiveness of classification models. Recent research positions LLMs as promising solutions to mitigate the data scarcity issues in education. In this paper, we provide a systematic literature review of recent approaches based on LLMs for generating textual data and augmenting training datasets in the broad areas of natural language processing and educational technology research. We analyze how prior works have approached data augmentation and generation across multiple steps of the model training process, and present a taxonomy consisting of a five-stage pipeline. Each stage covers a set of possible options representing decisions in the data augmentation process. We then apply a subset of the identified methods to three educational datasets across different domains and source languages to measure the effectiveness of the suggested augmentation approaches in educational contexts, finding improvements in overall balanced accuracy across all three datasets. Based on our findings, we propose our pipeline as a conceptual framework for future researchers aiming to augment educational datasets for improving classification accuracy1 .

1The open-source code of our experiments, as well as the prompts used for the LLM and the detailed results of our experiments, can all be found on: https://github.com/epfl-ml4ed/data-aug-education

Place, publisher, year, edition, pages
2025. p. 119-132
Keywords [en]
Data Augmentation, Large Language Models, Fine-tuning, Natural Language Processing, Text Classification
National Category
Artificial Intelligence
Identifiers
URN: urn:nbn:se:kth:diva-367955DOI: 10.5281/zenodo.15870195Scopus ID: 2-s2.0-105023299158OAI: oai:DiVA.org:kth-367955DiVA, id: diva2:1986514
Conference
International Conference on Educational Data Mining, July 20-23, Palermo, Italy
Note

QC 20251210

Available from: 2025-07-31 Created: 2025-07-31 Last updated: 2025-12-10Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopusConference

Authority records

Davis, Richard Lee

Search in DiVA

By author/editor
Davis, Richard Lee
By organisation
Digital Learning
Artificial Intelligence

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 42 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf