Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Making Big Data Smaller: Reducing the storage requirements for big data with erasure coding for Hadoop
KTH, School of Information and Communication Technology (ICT).
2014 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The amount of data stored in modern data centres is growing rapidly nowadays. Large-scale distributed file systems, that maintain the massive data sets in data centres, are designed to work with commodity hardware. Due to the quality and quantity of the hardware components in such systems, failures are considered normal events and, as such, distributed file systems are designed to be highly fault-tolerant. A common approach to achieve fault tolerance is using redundancy by storing three copies of a file across different storage nodes, thereby increasing the storage requirements by a factor of three and further aggravating the storage problem.

A concrete implementation of such a file system is the Hadoop Distributed File System (HDFS). This thesis explores the use of RAID-like mechanisms in order to decrease the storage requirements for big data. We designed and implemented a prototype that extends HDFS with a simple but powerful erasure coding API. Compared to existing approaches, we decided to locate the erasure-coding management logic in the HDFS NameNode, as this allows us to use internal HDFS APIs and state. Because of that, we can repair failures associated with erasurecoded files more quickly and with lower cost. We evaluate our prototype, and we also show that the use of erasure coding instead of replication can greatly decrease the storage requirements of big data without scarifying reliability and availability. Finally, we argue that our API can support a large range of custom encoding strategies, while adding the erasure coding logic to the NameNode can significantly improve the management of the encoded files.

Place, publisher, year, edition, pages
2014. , 76 p.
Series
TRITA-ICT-EX, 2014:98
National Category
Computer and Information Science
Identifiers
URN: urn:nbn:se:kth:diva-177201OAI: oai:DiVA.org:kth-177201DiVA: diva2:871985
Examiners
Available from: 2015-12-08 Created: 2015-11-17 Last updated: 2017-08-03Bibliographically approved

Open Access in DiVA

fulltext(1099 kB)6 downloads
File information
File name FULLTEXT01.pdfFile size 1099 kBChecksum SHA-512
8b5118fa951afc6b5388c01ae272b93af3583a023a13663754937e42741d8ff594aaf8bf37b98774a879d374676a6af45120352ce6c84c1f17b8bca2639a2fa8
Type fulltextMimetype application/pdf

By organisation
School of Information and Communication Technology (ICT)
Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 6 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 59 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf