Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
The amount of data stored in modern data centres is growing rapidly nowadays. Large-scale distributed file systems, that maintain the massive data sets in data centres, are designed to work with commodity hardware. Due to the quality and quantity of the hardware components in such systems, failures are considered normal events and, as such, distributed file systems are designed to be highly fault-tolerant. A common approach to achieve fault tolerance is using redundancy by storing three copies of a file across different storage nodes, thereby increasing the storage requirements by a factor of three and further aggravating the storage problem.
A concrete implementation of such a file system is the Hadoop Distributed File System (HDFS). This thesis explores the use of RAID-like mechanisms in order to decrease the storage requirements for big data. We designed and implemented a prototype that extends HDFS with a simple but powerful erasure coding API. Compared to existing approaches, we decided to locate the erasure-coding management logic in the HDFS NameNode, as this allows us to use internal HDFS APIs and state. Because of that, we can repair failures associated with erasurecoded files more quickly and with lower cost. We evaluate our prototype, and we also show that the use of erasure coding instead of replication can greatly decrease the storage requirements of big data without scarifying reliability and availability. Finally, we argue that our API can support a large range of custom encoding strategies, while adding the erasure coding logic to the NameNode can significantly improve the management of the encoded files.
2014. , 76 p.