kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
MBZUAI, United Arab Emirates.
MBZUAI, United Arab Emirates.
EMINES-UM6P, Morocco; LINAGORA, France.
MBZUAI, United Arab Emirates.
Show others and affiliations
2025 (English)In: LoResLM 2025 - 1st Workshop on Language Models for Low-Resource Languages, Proceedings of the Workshop, Association for Computational Linguistics (ACL) , 2025, p. 9-30Conference paper, Published paper (Refereed)
Abstract [en]

We introduce Atlas-Chat, the first-ever collection of LLMs specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-2B, 9B1, and 27B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., our 9B model gains a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource languages, which are often neglected in favor of data-rich languages by contemporary LLMs.

Place, publisher, year, edition, pages
Association for Computational Linguistics (ACL) , 2025. p. 9-30
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:kth:diva-361971Scopus ID: 2-s2.0-105000170936OAI: oai:DiVA.org:kth-361971DiVA, id: diva2:1949644
Conference
1st Workshop on Language Models for Low-Resource Languages, LoResLM 2025 - co-located at the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, United Arab Emirates, Jan 20 2025
Note

Part of ISBN 9798891762152

QC 20250403

Available from: 2025-04-03 Created: 2025-04-03 Last updated: 2025-04-03Bibliographically approved

Open Access in DiVA

No full text in DiVA

Scopus

Authority records

Ennadir, Sofiane

Search in DiVA

By author/editor
Ennadir, Sofiane
By organisation
Software and Computer systems, SCS
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 117 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf