Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Global Ecosystem for Datasets on Hadoop
KTH, School of Information and Communication Technology (ICT).
2016 (English)Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The immense growth of the web has led to the age of Big Data. Companies like Google, Yahoo and Facebook generates massive amounts of data everyday. In order to gain value from this data, it needs to be effectively stored and processed. Hadoop, a Big Data framework, can store and process Big Data in a scalable and performant fashion. Both Yahoo and Facebook, two major IT companies, deploy Hadoop as their solution to the Big Data problem. Many application areas for Big Data would benefit from the ability to share datasets across cluster boundaries. However, Hadoop does not support searching for datasets either local to a single Hadoop cluster or across many Hadoop clusters. Similarly, there is only limited support for copying datasets between Hadoop clusters (using Distcp). This project presents a solution to this weakness using the Hadoop distribution, Hops, and its frontend Hopsworks. Clusters advertise their peer-to-peer and search endpoints to a central server called Hops-Site. The advertised endpoints builds a global hadoop ecosystem and gives clusters the ability to participate in publicsearch or peer-to-peer sharing of datasets. HopsWorks users are given a choice to write data into Kafka as it’s being downloaded. This opens up new possibilities for data scientists who can interactively analyse remote datasets without having to download everything in advance. By writing data into Kafka as its being downloaded, it can be consumed by entities like Spark-streaming or Flink.

Place, publisher, year, edition, pages
2016. , 47 p.
Series
TRITA-ICT-EX, 2016:131
National Category
Computer and Information Science
Identifiers
URN: urn:nbn:se:kth:diva-205297OAI: oai:DiVA.org:kth-205297DiVA: diva2:1088359
Subject / course
Computer Science
Educational program
Master of Science in Engineering - Information and Communication Technology
Supervisors
Examiners
Available from: 2017-04-13 Created: 2017-04-12 Last updated: 2017-04-27Bibliographically approved

Open Access in DiVA

fulltext(715 kB)0 downloads
File information
File name FULLTEXT01.pdfFile size 715 kBChecksum SHA-512
2eedab7fa1d371bb69631411cf95cb17b9377c69279f3a48db46611b551e8349341b5b6d306e2bcfa21900ba6b644280e161556c549c69e9b6daa3d17d55100e
Type fulltextMimetype application/pdf

By organisation
School of Information and Communication Technology (ICT)
Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 8 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf