kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Capabilities of Large Language Models in Generating Complex Code: Comparing Code Generation Capabilities of Two Large Language Models: A Software Architecture Perspective
KTH, School of Electrical Engineering and Computer Science (EECS).
2024 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesisAlternative title
Stora språkmodellers förmåga att generera komplex kod : Jämförelse av två stora språkmodellers förmåga att generera kod med avseende på mjukvaruarkitektur (Swedish)
Abstract [en]

In recent years, advancements in the field of deep learning have led to the development of an advanced type of artificial neural networks called Large Language Models. These models have demonstrated promising capabilities in areas like natural language processing, computer vision and code generation. However, code generation evaluation of these models typically rely on simple benchmarks datasets with standalone tasks that only assess functional aspects of the code. Modern software development encompasses more than just writing code that passes unit tests; it also involves considerations of software architecture. If the industry should consider implementing these models in their workflow, not only is the code required to be functional, but it should fulfill the desired traits of software architecture.

This thesis presents a comparative evaluation of two prominent Large Language Models, GPT-3.5 Turbo and Wizardcoder15B focusing on their efficacy in generating Python code that not only meets functional requirements but also aligns with fundamental software architecture principles such as testability, modularity, and simplicity. The evaluation framework leverages task descriptions and related functional tests gathered from datasets HumanEval and ClassEval, supplemented by software architecture metrics derived from the literature review and workshops with domain experts. The findings reveal that while both Large Language Models struggle with generating functional programs for complex tasks, GPT3.5-Turbo demonstrates superior performance in terms of functional correctness. However, due to the limited availability of functional programs that both models complete, drawing definitive conclusions regarding software architecture quality necessitates further data and analysis.

Abstract [sv]

Under de senaste åren har framstegen inom djupinlärning lett till utvecklingen av en avancerad typ av artificiella neurala nätverk som kallas storskaliga språkmodeller. Dessa modeller har visat lovande kapacitet inom områden som naturlig språkbehandling, datorseende och kodgenerering. Men utvärderingen av kodgenerering för dessa modeller förlitar sig vanligtvis på enkla datamängder med fristående uppgifter som endast bedömer kodens funktionella aspekter. Modern programvaruutveckling handlar om mer än att bara skriva kod som klarar enhetstester; det kräver även hänsynstagande till programvarans arkitektur. Om branschen ska överväga att implementera dessa modeller i sina arbetsflöden måste den genererade koden inte bara vara funktionell utan också uppfylla önskade egenskaper för programvaruarkitektur. Den här avhandlingen presenterar en jämförande utvärdering av två framstående storskaliga språkmodeller, GPT-3.5 Turbo och Wizardcoder15B, med fokus på deras förmåga att generera Python-kod som inte bara uppfyller funktionella krav utan också följer grundläggande principer för programvaruarkitektur såsom testbarhet, modularitet och enkelhet. Utvärderingsramverket utnyttjar uppgiftsbeskrivningar och relaterade funktionella tester hämtade från dataset som HumanEval och ClassEval, kompletterat med mätvärden för programvaruarkitektur härledda från litteraturgranskning och workshops med ämnesexperter. Resultaten visar att medan båda de storskaliga språkmodellerna kämpar med att generera funktionella program för komplexa uppgifter, presterar GPT-3.5 Turbo bättre när det gäller funktionell korrekthet. På grund av den begränsade tillgängligheten av funktionella program som båda modellerna slutför krävs det dock mer data och ytterligare analyser för att dra definitiva slutsatser om programvaruarkitekturens kvalitet.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology , 2024. , p. 36
Series
TRITA-EECS-EX ; 2024:497
Keywords [en]
Large Language Models, Code generation, Software architecture, Cohesion, Cyclomatic complexity, ClassEval, HumanEval
National Category
Computer Sciences Computer Engineering Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-352307OAI: oai:DiVA.org:kth-352307DiVA, id: diva2:1892787
External cooperation
SAAB
Educational program
Bachelor of Science in Engineering - Computer Engineering
Supervisors
Examiners
Available from: 2024-09-27 Created: 2024-08-27 Last updated: 2024-09-27Bibliographically approved

Open Access in DiVA

fulltext(340 kB)286 downloads
File information
File name FULLTEXT01.pdfFile size 340 kBChecksum SHA-512
0aabf641d717829bffcaa5f423e6b55e3d8283b5a91176037bc156009f604c3083e4f8492abed97e7a192c39367a433dab2fc0802baa8f8a52104a991f0df4e3
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer SciencesComputer EngineeringComputer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 286 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 522 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf