kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
COCLUBERT: Clustering Machine Learning Source Code
KTH.
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Software and Computer systems, SCS.
RISE Res Inst Sweden, Stockholm, Sweden..
RISE Res Inst Sweden, Stockholm, Sweden..
Show others and affiliations
2021 (English)In: 20th IEEE international conference on machine learning and applications (ICMLA 2021) / [ed] Wani, MA Sethi, I Shi, W Qu, G Raicu, DS Jin, R, Institute of Electrical and Electronics Engineers (IEEE) , 2021, p. 151-158Conference paper, Published paper (Refereed)
Abstract [en]

Nowadays, we can find machine learning (ML) applications in nearly every aspect of modern life, and we see that more developers are engaged in the field than ever. In order to facilitate the development of new ML applications, it would be beneficial to provide services that enable developers to share, access, and search for source code easily. A step towards making such a service is to cluster source code by functionality. In this work, we present COCLUBERT, a BERT-based model for source code embedding based on their functionality and clustering them accordingly. We build COCLUBERT using CuBERT, a variant of BERT pre-trained on source code, and present three ways to fine-tune it for the clustering task. In the experiments, we compare COCLUBERT with a baseline model, where we cluster source code using CuBERT embedding without fine-tuning. We show that COCLUBERT significantly outperforms the baseline model by increasing the Dunn Index metric by a factor of 141, the Silhouette Score metric by a factor of two, and the Adjusted Rand Index metric by a factor of 11.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2021. p. 151-158
Keywords [en]
Source Code Clustering, NLP, BERT, CuBERT
National Category
Signal Processing
Identifiers
URN: urn:nbn:se:kth:diva-312970DOI: 10.1109/ICMLA52953.2021.00031ISI: 000779208200023Scopus ID: 2-s2.0-85125848071OAI: oai:DiVA.org:kth-312970DiVA, id: diva2:1661802
Conference
20th IEEE International Conference on Machine Learning and Applications (ICMLA), DEC 13-16, 2021, ELECTR NETWORK
Note

QC 20220530

Part of proceedings ISBN 978-1-6654-4337-1

Available from: 2022-05-30 Created: 2022-05-30 Last updated: 2022-06-25Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Hägglund, MarcusPena, Francisco J.Payberah, Amir H.

Search in DiVA

By author/editor
Hägglund, MarcusPena, Francisco J.Payberah, Amir H.
By organisation
KTHSoftware and Computer systems, SCS
Signal Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 50 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf