kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Source Code Representations of Deep Learning for Program Repair
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Theoretical Computer Science, TCS.ORCID iD: 0000-0002-6673-6438
2023 (English)Doctoral thesis, comprehensive summary (Other academic)Alternative title
Källkodsrepresentationer för djupinlärning av Programreparation (Swedish)
Abstract [en]

Deep learning, leveraging artificial neural networks, has demonstrated significant capabilities in understanding intricate patterns within data. In recent years, its prowess has been extended to the vast domain of source code, where it aids in diverse software engineering tasks such as program repair, code summarization, and vulnerability detection. However, using deep learning for analyzing source code poses unique challenges. This thesis primarily focuses on the challenges of representing source code to deep learning models for the purpose of automated program repair, a task that aims to automatically fix program bugs.

Source code, inherently different from natural languages, is both large in size and unique in vocabulary due to freely named identifiers, thus presenting the out-of-vocabulary challenge. Furthermore, its inherent precision requires exact representation; even a minor error can cause complete system failures. These characteristics underscore the importance of designing appropriate input and output representations for deep learning models, ensuring that they can efficiently and accurately process code for the purposes of program repair. The core contributions of this thesis address these challenges.

First, we propose a compact input representation that encapsulates the essential context for bug fixing. The compact input representation retains the relevant information that is essential to understanding the bug while removing unnecessary context that might add noise to the model.

Second, we tackle the out-of-vocabulary problem by harnessing techniques from natural language processing, capitalizing on existing code elements for bug fixes, and drawing parallels to the redundancy assumption in traditional program repair approaches.

Third, to address the precision of source code, we integrate bug information into the input representation and pivot the model's output from complete code generation to concise edit instructions, offering a more focused and accurate approach.

Last, we show that by unifying the source code representation across multiple code-related tasks, we facilitate transfer and multi-task learning. Both learning strategies can help in mitigating issues faced when training on limited datasets.

Abstract [sv]

Djupinlärning, som utnyttjar artificiella neurala nätverk, har visat betydande förmågor att förstå de komplexa mönster som finns i data. Under de senaste åren har dess förmåga utökats till den enorma domänen av källkod, där den hjälper till med olika uppgifter inom mjukvaruutveckling såsom programreparation, kodsummering och detektering av sårbarheter. Att använda djupinlärning för att analysera källkod medför dock unika utmaningar. Denna avhandling fokuserar främst på utmaningarna med att representera källkod för djupinlärningsmodeller i syfte att reparera program.

Källkod, som i grunden skiljer sig från naturliga språk, är både stor i storlek och unik i ordförråd på grund av fritt namngivna identifierare, vilket medför problemet med ord utanför ordförrådet. Dessutom kräver dess naturliga precision en exakt representation; även ett mindre fel kan orsaka totala systemfel. Dessa egenskaper understryker vikten av att designa lämpliga in- och utdatarepresentationer för djupinlärningsmodeller, för att säkerställa att de kan bearbeta koden effektivt och korrekt för ändamålet att reparera program. De centrala bidragen i denna avhandling löser dessa utmaningar.

För det första föreslår vi en kompakt indatarepresentation som fångar den väsentliga kontexten för buggfixning. Den kompakta indatarepresentationen behåller den relevanta informationen som är nödvändig för att förstå buggen, samtidigt som den tar bort onödig kontext som kan vara brus för modellen.

För det andra löser vi problemet med ord utanför ordförrådet genom att utnyttja tekniker från naturlig språkbehandling, och dra nytta av befintliga kodelement för buggfixar, vilket drar paralleller till redundansantagandet i traditionella programreparationsmetoder.

För det tredje, för att hantera källkodens precision, integrerar vi bugg information i indatarepresentationen och ändrar modellens utdata från fullständig kodgenerering till korta redigeringsinstruktioner, vilket erbjuder ett mer fokuserat och korrekt tillvägagångssätt.

Slutligen visar vi att genom att ena källkodsrepresentationen över flera kodrelaterade uppgifter underlättar vi överföring och fleruppgiftsinlärning. Båda inlärningsstrategierna kan mildra problem som uppstår vid träning på begränsade data.

Place, publisher, year, edition, pages
Sweden: KTH Royal Institute of Technology, 2023. , p. xi, 117
Series
TRITA-EECS-AVL ; 2023:83
Keywords [en]
Code Representation, Deep Learning, Program Repair
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-339763ISBN: 978-91-8040-764-9 (print)OAI: oai:DiVA.org:kth-339763DiVA, id: diva2:1812919
Public defence
2023-12-11, F3, Lindstedtsvägen 26, Stockholm, 09:00 (English)
Opponent
Supervisors
Funder
Swedish Foundation for Strategic Research, Trustfull
Note

QC 20231117

Available from: 2023-11-17 Created: 2023-11-17 Last updated: 2023-11-21Bibliographically approved
List of papers
1. SEQUENCER: Sequence-to-Sequence Learning for End-to-End Program Repair
Open this publication in new window or tab >>SEQUENCER: Sequence-to-Sequence Learning for End-to-End Program Repair
Show others...
2019 (English)In: IEEE Transactions on Software Engineering, ISSN 0098-5589, E-ISSN 1939-3520, Vol. 47, no 9Article in journal (Refereed) Published
Abstract [en]

This paper presents a novel end-to-end approach to program repair based on sequence-to-sequence learning. We devise, implement, and evaluate a technique, called SEQUENCER, for fixing bugs based on sequence-to-sequence learning on source code. This approach uses the copy mechanism to overcome the unlimited vocabulary problem that occurs with big code. Our system is data-driven; we train it on 35,578 samples, carefully curated from commits to open-source repositories. We evaluate SEQUENCER on 4,711 independent real bug fixes, as well on the Defects4J benchmark used in program repair research. SEQUENCER is able to perfectly predict the fixed line for 950/4,711 testing samples, and find correct patches for 14 bugs in Defects4J benchmark. SEQUENCER captures a wide range of repair operators without any domain-specific top-down design.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2019
Keywords
machine learning, program repair, Defects, Learning systems, Open source software, Open systems, Data driven, Domain specific, Open source repositories, Repair operator, Sequence learning, Source codes, Testing samples, Top down design, Repair
National Category
Software Engineering
Identifiers
urn:nbn:se:kth:diva-263253 (URN)10.1109/TSE.2019.2940179 (DOI)000696667700011 ()2-s2.0-85072171801 (Scopus ID)
Note

QC 20211011

Available from: 2019-11-06 Created: 2019-11-06 Last updated: 2023-11-17Bibliographically approved
2. PLUR: A Unifying, Graph-Based View of Program Learning, Understanding, and Repair
Open this publication in new window or tab >>PLUR: A Unifying, Graph-Based View of Program Learning, Understanding, and Repair
Show others...
2021 (English)In: Advances in Neural Information Processing Systems, Neural Information Processing Systems Foundation , 2021, Vol. 28, p. 23089-23101Conference paper, Published paper (Refereed)
Abstract [en]

Machine learning for understanding and editing source code has recently attracted significant interest, with many developments in new models, new code representations, and new tasks. This proliferation can appear disparate and disconnected, making each approach seemingly unique and incompatible, thus obscuring the core machine learning challenges and contributions. In this work, we demonstrate that the landscape can be significantly simplified by taking a general approach of mapping a graph to a sequence of tokens and pointers. Our main result is to show that 16 recently published tasks of different shapes can be cast in this form, based on which a single model architecture achieves near or above state-of-the-art results on nearly all tasks, outperforming custom models like code2seq and alternative generic models like Transformers. This unification further enables multitask learning and a series of cross-cutting experiments about the importance of different modeling choices for code understanding and repair tasks. The full framework, called PLUR, is easily extensible to more tasks, and will be open-sourced (https://github.com/google-research/plur).

Place, publisher, year, edition, pages
Neural Information Processing Systems Foundation, 2021
Series
Advances in neural information processing systems, ISSN 10495258
Keywords
Codes (symbols), Machine learning, Repair, Code representation, Custom models, Different shapes, Generic modeling, Graph-based, Modeling architecture, Program learning, Single models, Source codes, State of the art, Graphic methods
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-316197 (URN)000922928403005 ()2-s2.0-85129742594 (Scopus ID)
Conference
35th Conference on Neural Information Processing Systems, NeurIPS 2021, Virtual/Online, 6 - 14 December 2021
Note

Part of proceedings: ISBN 978-1-7138-4539-3 

QC 20220907

Available from: 2022-09-07 Created: 2022-09-07 Last updated: 2023-11-17Bibliographically approved
3. Neural Transfer Learning for Repairing Security Vulnerabilities in C Code
Open this publication in new window or tab >>Neural Transfer Learning for Repairing Security Vulnerabilities in C Code
2023 (English)In: IEEE Transactions on Software Engineering, ISSN 0098-5589, E-ISSN 1939-3520, Vol. 49, no 1, p. 147-165Article in journal (Refereed) Published
Abstract [en]

In this paper, we address the problem of automatic repair of software vulnerabilities with deep learning. The major problem with data-driven vulnerability repair is that the few existing datasets of known confirmed vulnerabilities consist of only a few thousand examples. However, training a deep learning model often requires hundreds of thousands of examples. In this work, we leverage the intuition that the bug fixing task and the vulnerability fixing task are related and that the knowledge learned from bug fixes can be transferred to fixing vulnerabilities. In the machine learning community, this technique is called transfer learning. In this paper, we propose an approach for repairing security vulnerabilities named VRepair which is based on transfer learning. VRepair is first trained on a large bug fix corpus and is then tuned on a vulnerability fix dataset, which is an order of magnitude smaller. In our experiments, we show that a model trained only on a bug fix corpus can already fix some vulnerabilities. Then, we demonstrate that transfer learning improves the ability to repair vulnerable C functions. We also show that the transfer learning model performs better than a model trained with a denoising task and fine-tuned on the vulnerability fixing task. To sum up, this paper shows that transfer learning works well for repairing security vulnerabilities in C compared to learning on a small dataset.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Keywords
Codes, Computer bugs, seq2seq learning, Software, Task analysis, Training, transfer learning, Transformers, vulnerability fixing, C (programming language), Costs, Deep learning, Job analysis, Knowledge management, Personnel training, Program debugging, Code, Security vulnerabilities, Transformer, Repair
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering Computer Sciences
Identifiers
urn:nbn:se:kth:diva-320561 (URN)10.1109/TSE.2022.3147265 (DOI)001020827200008 ()2-s2.0-85124188450 (Scopus ID)
Funder
Swedish Foundation for Strategic Research, trustfull
Note

QC 20231117

Available from: 2022-10-26 Created: 2022-10-26 Last updated: 2023-11-17Bibliographically approved
4. Supersonic: Learning to Generate Source Code Optimizations in C/C++
Open this publication in new window or tab >>Supersonic: Learning to Generate Source Code Optimizations in C/C++
2023 (English)Manuscript (preprint) (Other academic)
Abstract [en]

Software optimization refines programs for resource efficiency while preserving functionality. Traditionally, it is a process done by developers and compilers. This paper introduces a third option, automated optimization at the source code level. We present SUPERSONIC, a neural approach targeting minor source code modifications for optimization. Using a seq2seq model, SUPERSONIC is trained on C/C++ program pairs (xt, xt+1), where xt+1 is an optimized version of xt, and outputs a diff. SUPERSONIC’s performance is benchmarked against OpenAI’s GPT-3.5-Turbo and GPT-4 on competitive programming tasks. The experiments show that SUPERSONIC not only outperforms both models on the code optimization task but also minimizes the extent of the change with a model more than 600x smaller than GPT-3.5-Turbo and 3700x smaller than GPT-4.

National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-339526 (URN)
Funder
Swedish Foundation for Strategic Research, Trustfull
Note

QC 20231120

Available from: 2023-11-13 Created: 2023-11-13 Last updated: 2023-12-01Bibliographically approved

Open Access in DiVA

Zimin_Chen_kappa(2803 kB)1237 downloads
File information
File name FULLTEXT01.pdfFile size 2803 kBChecksum SHA-512
1995578fe83f0916436ee82367e081ea60d301a31d6e6634f9b28500025bba4126cbc4ff9f2fc63f78edaa2ba02c59cd08999d657439f9b2e619ee3abffe0f29
Type fulltextMimetype application/pdf

Authority records

Chen, Zimin

Search in DiVA

By author/editor
Chen, Zimin
By organisation
Theoretical Computer Science, TCS
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 1237 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 2207 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf