CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Scaling Distributed Hierarchical File Systems Using NewSQL Databases
KTH, School of Electrical Engineering and Computer Science (EECS), Software and Computer systems, SCS.ORCID iD: 0000-0002-1672-6899
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

For many years, researchers have investigated the use of database technology to manage file system metadata, with the goal of providing extensible typed metadata and support for fast, rich metadata search. However, earlier attempts failed mainly due to the reduced performance introduced by adding database operations to the file system’s critical path. Recent improvements in the performance of distributed in-memory online transaction processing databases (NewSQL databases) led us to re-investigate the possibility of using a database to manage file system metadata, but this time for a distributed, hierarchical file system, the Hadoop Distributed File System (HDFS). The single-host metadata service of HDFS is a well-known bottleneck for both the size of the HDFS clusters and their throughput.In this thesis, we detail the algorithms, techniques, and optimizations used to develop HopsFS, an open-source, next-generation distribution of the HDFS that replaces the main scalability bottleneck in HDFS, single node in-memory metadata service, with a no-shared state distributed system built on a NewSQL database. In particular, we discuss how we exploit recent high-performance features from NewSQL databases, such as application-defined partitioning, partition pruned index scans, and distribution aware transactions, as well as more traditional techniques such as batching and write-ahead caches, to enable a revolution in distributed hierarchical file system performance.HDFS’ design is optimized for the storage of large files, that is, files ranging from megabytes to terabytes in size. However, in many production deployments of the HDFS, it has been observed that almost 20% of the files in the system are less than 4 KB in size and as much as 42% of all the file system operations are performed on files less than 16 KB in size. HopsFS introduces a tiered storage solution to store files of different sizes more efficiently. The tiers range from the highest tier where an in-memory NewSQL database stores very small files (<1 KB), to the next tier where small files (<64 KB) are stored in solid-state-drives (SSDs), also using a NewSQL database, to the largest tier, the existing Hadoop block storage layer for very large files. Our approach is based on extending HopsFS with an inode stuffing technique, where we embed the contents of small files with the metadata and use database transactions and database replication guarantees to ensure the availability, integrity, and consistency of the small files. HopsFS enables significantly larger cluster sizes, more than an order of magnitude higher throughput, and significantly lower client latencies for large clusters.Lastly, coordination is an integral part of the distributed file system operations protocols. We present a novel leader election protocol for partially synchronous systems that uses NewSQL databases as shared memory. Our work enables HopsFS, that uses a NewSQL database to save the operational overhead of managing an additional third-party service for leader election and deliver performance comparable to a leader election implementation using a state-of-the-art distributed coordination service, ZooKeeper.

Abstract [sv]

I många år har forskare undersökt användningen av databasteknik för att hantera metadata i filsystem, med målet att tillhandahålla förlängbar metadata med stöd för snabb och uttrycksfull metadatasökning. Tidigare försök misslyckades dock huvudsakligen till följd av den reducerade prestanda som infördes genom att lägga till databasoperationer på filsystemets kritiska väg. De senaste prestandaförbättringarna för OLTP databaser som lagras i minnet (NewSQL databaser) ledde oss till att undersöka möjligheten att använda en databas för att hantera filsystemmetadata, men den här gången för ett distribuerat hierarkiskt filsystem, Hadoop Distributed Filesystem (HDFS). Metadata i HDFS lagras på en maskin, vilket är en känd flaskhals för såväl storlek som prestandan för HDFS kluster.I denna avhandling redogör vi för de algoritmer, tekniker och optimeringar som används för att utveckla HopsFS, en med öppen källkod, nästa generationens distribution av HDFS som ersätter lagringen av metadata i HDFS, där den lagras enbart i minnet på en nod, med ett distribuerat system med delat tillstånd byggt på en NewSQL databas. I synnerhet diskuteras hur vi utnyttjar nyligen framtagen högpresterande funktionalitet från NewSQL-databaser, exempelvis applikationsdefinierad partitionering, partitionsskuren indexskanning och distributionsmedvetna transaktioner, samt mer traditionella tekniker som batching och skrivcache, som banar väg för en revolution inom prestanda för distribuerade filsystem.HDFS design är optimerad för lagring av stora filer, det vill säga filer som sträcker sig från megabyte till terabyte i storlek. Men i många installationer i produktionsystem har det observerats att nästan 20 procent av filerna i systemet är mindre än 4 KB i storlek och så mycket som 42 procent av alla filsystemsoperationer utförs på filer mindre än 16 KB i storlek. HopsFS introducerar en nivåbaserad uppdelning av olika filstorlekar för mer effektiv lagring . Nivåerna varierar från högsta nivån där en NewSQL-databas lagrar i minnet mycket små filer (<1 KB), till nästa nivå där små filer (<64 KB) lagras i SSD-enheter (Solid State Drives) en NewSQL-databas, till den största delen, det befintliga Hadoop-blocklagringsskiktet för mycket stora filer. Vårt tillvägagångssätt bygger på att utöka HopsFS med en utfyllningsteknik för filer, där vi lägger in innehållet i små filer tillsammans med metadata och använder databasstransaktioner och databasreplikation för att garantera de små filernas tillgänglighet, integritet och konsistens säkerställs. HopsFS möjliggör signifikant större klusterstorlekar, mer än en storleksordning högre transaktionsgenomströmning, och signifikant lägre latens för klienter till stora kluster.Slutligen är koordinering en central del av protokollet för distribuerade filsystemsoperationer. Vi presenterar ett nytt ledarval protokoll för delvis synkrona system som använder NewSQL databaser som delat minne. Vårt arbete möjliggör HopsFS, som använder en NewSQL-databas för att spara in på de operativa kostnader det skulle medföra att hantera en ytterligare tredjepartstjänst för ledarval. Protokollets prestanda kan jämföras med en ledarval implementation I ZooKeeper, som är en modern distribuerad koordinationsservice.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2018. , p. 196
Series
TRITA-EECS-AVL ; 2018 : 79
National Category
Computer Systems
Research subject
Information and Communication Technology
Identifiers
URN: urn:nbn:se:kth:diva-238605ISBN: 978-91-7729-987-5 (print)OAI: oai:DiVA.org:kth-238605DiVA, id: diva2:1260852
Public defence
2018-12-07, Sal-B, Electrum, Kungliga Tekniska Högskolan, Kistagången 16, Kista, Stockholm, 13:45 (English)
Opponent
Supervisors
Available from: 2018-11-05 Created: 2018-11-05 Last updated: 2018-11-06Bibliographically approved
List of papers
1. HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases
Open this publication in new window or tab >>HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases
Show others...
2017 (English)In: 15th USENIX Conference on File and Storage Technologies, FAST 2017, Santa Clara, CA, USA, February 27 - March 2, 2017, USENIX Association , 2017, p. 89-103Conference paper, Published paper (Refereed)
Abstract [en]

Recent improvements in both the performance and scalability of shared-nothing, transactional, in-memory NewSQL databases have reopened the research question of whether distributed metadata for hierarchical file systems can be managed using commodity databases. In this paper, we introduce HopsFS, a next generation distribution of the Hadoop Distributed File System (HDFS) that replaces HDFS’ single node in-memory metadata service, with a distributed metadata service built on a NewSQL database. By removing the metadata bottleneck, HopsFS enables an order of magnitude larger and higher throughput clusters compared to HDFS. Metadata capacity has been increased to at least 37 times HDFS’ capacity, and in experiments based on a workload trace from Spotify, we show that HopsFS supports 16 to 37 times the throughput of Apache HDFS. HopsFS also has lower latency for many concurrent clients, and no downtime during failover. Finally, as metadata is now stored in a commodity database, it can be safely extended and easily exported to external systems for online analysis and free-text search.

Place, publisher, year, edition, pages
USENIX Association, 2017
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-205355 (URN)000427295900007 ()
Conference
15th USENIX Conference on File and Storage Technologies, FAST 2017, Santa Clara, CA, USA, February 27 - March 2, 2017
Funder
EU, FP7, Seventh Framework Programme, 317871Swedish Foundation for Strategic Research , E2E-Clouds
Note

QC 20170424

Available from: 2017-04-13 Created: 2017-04-13 Last updated: 2018-11-05Bibliographically approved
2. Scaling HDFS to more than 1 million operations per second with HopsFS
Open this publication in new window or tab >>Scaling HDFS to more than 1 million operations per second with HopsFS
Show others...
2017 (English)In: Proceedings - 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017, Institute of Electrical and Electronics Engineers Inc. , 2017, p. 683-688Conference paper, Published paper (Refereed)
Abstract [en]

HopsFS is an open-source, next generation distribution of the Apache Hadoop Distributed File System(HDFS) that replaces the main scalability bottleneck in HDFS, single node in-memory metadata service, with a no-sharedstate distributed system built on a NewSQL database. By removing the metadata bottleneck in Apache HDFS, HopsFS enables significantly larger cluster sizes, more than an order of magnitude higher throughput, and significantly lower clientlatencies for large clusters. In this paper, we detail the techniques and optimizations that enable HopsFS to surpass 1 million file system operations per second-at least 16 times higher throughput than HDFS. In particular, we discuss how we exploit recent high performance features from NewSQL databases, such as application defined partitioning, partition-pruned index scans, and distribution aware transactions. Together with more traditional techniques, such as batching and write-Ahead caches, we show how many incremental optimizations have enabled a revolution in distributed hierarchical file system performance.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers Inc., 2017
Keywords
Distributed File System, File System Design, High-performance file systems, NewSQL, Cluster computing, Computer software, Distributed database systems, File organization, Grid computing, Metadata, Open systems, Distributed file systems, Distributed systems, File systems, Hadoop distributed file system (HDFS), Incremental optimization, Metadata services, Traditional techniques, Distributed computer systems
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-216289 (URN)10.1109/CCGRID.2017.117 (DOI)000426912900075 ()2-s2.0-85027463411 (Scopus ID)9781509066100 (ISBN)
Conference
17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017, 14 May 2017 through 17 May 2017
Note

QC 20171211

Available from: 2017-12-11 Created: 2017-12-11 Last updated: 2018-11-05Bibliographically approved
3. Size Matters: Improving the Performance of Small Files in Hadoop
Open this publication in new window or tab >>Size Matters: Improving the Performance of Small Files in Hadoop
2018 (English)Conference paper, Published paper (Refereed)
Abstract [en]

The Hadoop Distributed File System (HDFS) is designed to handle massive amounts of data, preferably stored in very large files. The poor performance of HDFS in managing small files has long been a bane of the Hadoop community. In many production deployments of HDFS, almost 25% of the files are less than 16 KB in size and as much as 42% of all the file system operations are performed on these small files. We have designed an adaptive tiered storage using in-memory and on-disk tables stored in a high-performance distributed database to efficiently store and improve the performance of the small files in HDFS. Our solution is completely transparent, and it does not require any changes in the HDFS clients or the applications using the Hadoop platform. In experiments, we observed up to 61~times higher throughput in writing files, and for real-world workloads from Spotify our solution reduces the latency of reading and writing small files by a factor of 3.15 and 7.39 respectively.

National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-238597 (URN)
Conference
Middleware’18. ACM, Rennes, France
Note

QC 20181106

Available from: 2018-11-05 Created: 2018-11-05 Last updated: 2018-11-05Bibliographically approved
4. Leader Election Using NewSQL Database Systems
Open this publication in new window or tab >>Leader Election Using NewSQL Database Systems
2015 (English)In: Distributed Applications and Interoperable Systems: 15th IFIP WG 6.1 International Conference, DAIS 2015, Held as Part of the 10th International Federated Conference on Distributed Computing Techniques, DisCoTec 2015, Grenoble, France, June 2-4, 2015, Proceedings / [ed] Alysson Bessani and Sara Bouchenak, France: Springer, 2015, p. 158-172Conference paper, Published paper (Refereed)
Abstract [en]

Leader election protocols are a fundamental building blockfor replicated distributed services. They ease the design of leader-basedcoordination protocols that tolerate failures. In partially synchronoussystems, designing a leader election algorithm, that does not permit mul-tiple leaders while the system is unstable, is a complex task. As a resultmany production systems use third-party distributed coordination ser-vices, such as ZooKeeper and Chubby, to provide a reliable leader electionservice. However, adding a third-party service such as ZooKeeper to adistributed system incurs additional operational costs and complexity.ZooKeeper instances must be kept running on at least three machinesto ensure its high availability. In this paper, we present a novel leaderelection protocol using NewSQL databases for partially synchronous sys-tems, that ensures at most one leader at any given time. The leaderelection protocol uses the database as distributed shared memory. Ourwork enables distributed systems that already use NewSQL databasesto save the operational overhead of managing an additional third-partyservice for leader election. Our main contribution is the design, imple-mentation and validation of a practical leader election algorithm, basedon NewSQL databases, that has performance comparable to a leaderelection implementation using a state-of-the-art distributed coordinationservice, ZooKeeper

Place, publisher, year, edition, pages
France: Springer, 2015
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 9038
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-168266 (URN)10.1007/978-3-319-19129-4_13 (DOI)978-3-319-19129-4 (ISBN)
Conference
DisCoTec 2015 The 10th International Federated Conference on Distributed Computing Techniques,June 2-5, 2015,Grenoble, France
Note

QC 20150828

Available from: 2015-05-29 Created: 2015-05-29 Last updated: 2018-11-05Bibliographically approved

Open Access in DiVA

fulltext(3542 kB)35 downloads
File information
File name FULLTEXT02.pdfFile size 3542 kBChecksum SHA-512
34370bcf4e4ce8318d1b547e1c91225cb682997468c4ea7d4cef4b3ce6461e3b2ee363ffaa591655fb48d3a6b84aa175998ce10a037321d04ffcbdc72dc10f40
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Niazi, Salman
By organisation
Software and Computer systems, SCS
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 35 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 307 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf