Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Dela-Sharing Large Datasets between Hadoop Clusters
KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
2017 (English)In: Proceedings - International Conference on Distributed Computing Systems, Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 2533-2536, article id 7980225Conference paper (Refereed)
Abstract [en]

Big data has, in recent years, revolutionised an evergrowing number of fields, from machine learning to climate science to genomics. The current state-of-the-art for storing large datasets is either object stores or distributed filesystems, with Hadoop being the dominant open-source platform for managing 'Big Data'. Existing large-scale storage platforms, however, lack support for the efficient sharing of large datasets over the Internet. Those systems that are widely used for the dissemination of large files, like BitTorrent, need to be adapted to handle challenges such as network links with both high latency and high bandwidth, and scalable storage backends that are optimised for streaming and not random access. In this paper, we introduce Dela, a peer-to-peer data-sharing service integrated into the Hops Hadoop platform that provides an end-to-end solution for dataset sharing. Dela is designed for large-scale storage backends and data transfers that are both non-intrusive to existing TCP network traffic and provide higher network throughput than TCP on high latency, high bandwidth network links, such as transatlantic network links. Dela provides a pluggable storage layer, implementing two alternative ways for clients to access shared data: stream processing of data as it arrives with Kafka, and traditional offline access to data using the Hadoop Distributed Filesystem. Dela is the first step for the Hadoop platform towards creating an open dataset ecosystem that supports user-friendly publishing, searching, and downloading of large datasets.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2017. p. 2533-2536, article id 7980225
Keywords [en]
Big Data, BitTorrent, Dataset sharing, Hadoop, Peer-to-peer
National Category
Other Computer and Information Science
Identifiers
URN: urn:nbn:se:kth:diva-212447DOI: 10.1109/ICDCS.2017.199ISI: 000412759500276Scopus ID: 2-s2.0-85027245167ISBN: 9781538617915 (print)OAI: oai:DiVA.org:kth-212447DiVA, id: diva2:1135129
Conference
37th IEEE International Conference on Distributed Computing Systems, ICDCS 2017, J.W. Marriott Hotel, Atlanta, United States, 5 June 2017 through 8 June 2017
Note

QC 20170822

Available from: 2017-08-22 Created: 2017-08-22 Last updated: 2018-01-13Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Search in DiVA

By author/editor
Ormenisan, Alexandru-AdrianDownling, Jim
By organisation
Software and Computer systems, SCS
Other Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 22 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf