Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Parallelization of Dataset Transformation with Processing Order Constraints in Python
KTH, School of Computer Science and Communication (CSC).
2016 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
Parallelisering av datamängdstransformation med ordningsbegränsningar i Python (Swedish)
Abstract [en]

Financial data is often represented with rows of values, contained in a dataset. This data needs to be transformed into a common format in order for comparison and matching to be made, which can take a long time for larger datasets. The main goal of this master’s thesis is speeding up these transformations through parallelization using Python multiprocessing. The datasets in question consist of several rows representing trades, and are transformed into a common format using rules known as filters. In order to devise a parallelization strategy, the filters were analyzed in order to find ordering constraints, and the Python profiler cProfile was used to find bottlenecks and potential parallelization points. This analysis resulted in the use of a task-based approach for the implementation, in which the transformation was divided into an initial sequential pre-processing step, a parallel step where chunks of several trade rows were distributed among workers, and a sequential post processing step.

The implementation was tested by transforming four datasets of differing sizes using up to 16 workers, and execution time and memory consumption was measured. The results for the tiny, small, medium, and large datasets showed a speedup of 0.5, 2.1, 3.8, and 4.81. They also showed linearly increasing memory consumption for all datasets. The test transformations were also profiled in order to understand the parallel program’s behaviour for the different datasets. The experiments gave way to the conclusion that dataset size heavily influences the speedup, partly because of the fact that the sequential parts become less significant. In addition, the large memory increase for larger amount of workers is noted as a major downside of multiprocessing when using caching mechanisms, as data is duplicated instead of shared.

This thesis shows that it is possible to speed up the dataset transformations using chunks of rows as tasks, though the speedup is relatively low. 

Abstract [sv]

Finansiell data representeras ofta med rader av värden, samlade i en datamängd. Denna data måste transformeras till ett standardformat för att möjliggöra jämförelser och matchning. Detta kan ta lång tid för stora datamängder. Huvudmålet för detta examensarbete är att snabba upp dessa transformationer genom parallellisering med hjälp av Python-modulen multiprocessing. Datamängderna omvandlas med hjälp av regler, kallade filter. Dessa filter analyserades för att identifiera begränsningar på ordningen i vilken datamängden kan behandlas, och därigenom finna en parallelliseringsstrategi. Python-profileraren cProfile an- vändes även för att hitta potentiella parallelliseringspunkter i koden. Denna analys resulterade i användandet av ett “task”-baserat tillvägagångssätt, där transformationen delades in i ett sekventiellt pre-processingsteg, ett parallelt steg där grupper av rader distribuerades ut bland arbetar-processer, och ett sekventiellt post-processingsteg.

Implementationen testades genom transformation av fyra datamängder av olika storlekar, med upp till 16 arbetarprocesser. Resultaten för de fyra datamängderna var en speedup på 0.5, 2.1, 3.8 respektive 4.81. En linjär ökning i minnesanvändning uppvisades även. Experimenten resulterade i slutsatsen att datamängdens storlek var en betydande faktor i hur mycket speedup som uppvisades, delvis på grund av faktumet att de sekventiella delarna tar upp en mindre del av programmet. Den stora minnesåtgången noterades som en nackdel med att använda multiprocessing i kombination med cachning, på grund av duplicerad data.

Detta examensarbete visar att det är möjligt att snabba upp datamängdstransformation genom att använda radgrupper som tasks, även om en relativt låg speedup uppvisades.

Place, publisher, year, edition, pages
2016.
Keyword [en]
parallel computing, multicore, Python
National Category
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-189574OAI: oai:DiVA.org:kth-189574DiVA: diva2:947130
External cooperation
TriOptima
Educational program
Master of Science in Engineering - Computer Science and Technology
Supervisors
Examiners
Available from: 2016-07-07 Created: 2016-07-07 Last updated: 2016-07-07Bibliographically approved

Open Access in DiVA

fulltext(1576 kB)77 downloads
File information
File name FULLTEXT01.pdfFile size 1576 kBChecksum SHA-512
5d9c53f70d80618a6598d9de9db20840b9fb60d59642e74b0541f79507ab6a15292c09866f4ed77d002891f0ec47e323573287e64bbcee0b4d7edb4c95ae15c1
Type fulltextMimetype application/pdf

By organisation
School of Computer Science and Communication (CSC)
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 77 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 517 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf