Message-passing concurrency (MPC) is increasingly being used to build systems software that scales well on multi-core hardware. Functional programming implementations of MPC, such as Erlang, have also leveraged their stateless nature to build middleware that is not just scalable, but also dynamically reconfigurable. However, many middleware platforms lend themselves more naturally to a stateful programming model, supporting session and application state. A limitation of existing programming models and frameworks that support dynamic reconfiguration for stateful middleware, such as component frameworks, is that they are not designed for MPC.
In this paper, we present Kompics, a component model and programming framework, that supports the construction and composition of dynamically reconfigurable middleware using stateful, concurrent, message-passing components. An added benefit of our approach is that by decoupling our component execution model, we can run the same code in both simulation and production environments. We present the architectural patterns and abstractions that Kompics facilitates and we evaluate them using a case study of a non-trivial key-value store that we built using Kompics. We show how our model enables the systematic development and testing of scalable, dynamically reconfigurable middleware.
This paper describes an operational geographically distributed and heterogeneous cloudinfrastructure with services and applications deployed in the Guifi community network. The presentedcloud is a particular case of a community cloud, developed according to the specific needs and conditions of community networks. We describe the concept of this community cloud, explain our technical choices for building it, and our experience with the deployment of this cloud. We review our solutions and experience on offering the different service models of cloud computing (IaaS, PaaS and SaaS) in community networks. The deployed cloud infrastructure aims to provide stable and attractive cloud services in order to encourage community network user to use, keep and extend it with new services and applications.
Biobanks store and catalog human biological material that is increasingly being digitized using next-generation sequencing (NGS). There is, however, a computational bottleneck, as existing software systems are not scalable and secure enough to store and process the incoming wave of genomic data from NGS machines. In the BiobankCloud project, we are building a Hadoop-based platform for the secure storage, sharing, and parallel processing of genomic data. We extended Hadoop to include support for multi-tenant studies, reduced storage requirements with erasure coding, and added support for extensible and consistent metadata. On top of Hadoop, we built a scalable scientific workflow engine featuring a proper workflow definition language focusing on simple integration and chaining of existing tools, adaptive scheduling on Apache Yarn, and support for iterative dataflows. Our platform also supports the secure sharing of data across different, distributed Hadoop clusters. The software is easily installed and comes with a user-friendly web interface for running, managing, and accessing data sets behind a secure 2-factor authentication. Initial tests have shown that the engine scales well to dozens of nodes. The entire system is open-source and includes pre-defined workflows for popular tasks in biomedical data analysis, such as variant identification, differential transcriptome analysis using RNA-Seq, and analysis of miRNA-Seq and ChIP-Seq data.
Despite the recent appearance of self-organizing distributed systems for Mobile Ad Hoc Networks (MANETs) and Peer-to-Peer (P2P) networks, specific theoretical aspects of both their properties and the mechanisms used to establish those properties have been largely overlooked. This has left many researchers confused as to what constitutes a self-organizing distributed system and without a vocabulary with which to discuss aspects of these systems. This article introduces an agent-based model of self-organizing MANET and P2P systems and shows how it is realised in three existing network systems. The model is based on concepts such as partial views, evaluation functions, system utility, feedback and decay. We review the three network systems, AntHocNet, SAMPLE, and Freenet, and show how they can achieve high scalability, robustness and adaptability to unpredictable changes in their environment, by using self-organizing mechanisms similar to those found in nature. They are designed to improve their operation in a dynamic, heterogeneous environment, enabling them to often demonstrate superior performance to state of the art distributed systems. This article is also addressed at researchers interested in gaining a general understanding of different mechanisms and properties of self-organization in distributed systems.
Across many fields of science, primary data sets like sensor read-outs, time series, and genomic sequences are analyzed by complex chains of specialized tools and scripts exchanging intermediate results in domain-specific file formats. Scientific work ow management systems (SWfMSs) support the development and execution of these tool chains by providing work ow specification languages, graphical editors, fault-tolerant execution engines, etc. However, many SWfMSs are not prepared to handle large data sets because of inadequate support for distributed computing. On the other hand, most SWfMSs that do support distributed computing only allow static task execution orders. We present SAASFEE, a SWfMS which runs arbitrarily complex work ows on Hadoop YARN. Work ows are specified in Cuneiform, a functional work ow language focusing on parallelization and easy integration of existing software. Cuneiform work ows are executed on Hi-WAY, a higher-level scheduler for running work ows on YARN. Distinct features of SAASFEE are the ability to execute iterative work ows, an adaptive task scheduler, re-executable provenance traces, and compatibility to selected other work ow systems. In the demonstration, we present all components of SAASFEE using real-life work ows from the field of genomics.
The DIGHT project is addressing the problem of building a scalable and highly available information store for the Electronic Health Records (EHRs) of the over one billion citizens of India.
There has been much recent interest in information services that offer to manage an individual's healthcare records in electronic form, with systems such as Microsoft HealthVault and Google Health receiving widespread media attention. These systems are, however, proprietary and fears have been expressed over how the information stored in them will be used. In relation to these developments, countries with nationalized healthcare systems are also investigating the construction of healthcare information systems that store Electronic Health Records (EHRs) for their citizens.
Internet Connectivity Establishment (ICE) is becoming increasingly important for P2P systems on the open Internet, as it enables NAT-bound peers to provide accessible services. A problem for P2P systems that provide ICE services is how peers discover good quality ICE servers for NAT traversal, that is, the TURN and STUN servers that provide relaying and hole-punching services, respectively. Skype provides a P2P-based solution to this problem, where super-peers provide ICE services. However experimental analysis of Skype indicates that peers perform a random walk of super-peers to find one with an acceptable round-trip latency. In this paper, we discuss a self organizing approach to discovering good quality ICE servers in a P2P system based the walk Topology. The walk Topology uses information about each peer’s ability to provide ICE services (open IP address, available bandwidth and expected session times) to construct a topology where the “better” peers for providing ICE services cluster in the center of the topology; this adaptation of the super-peer search space reduces the problem of finding a good quality ICE server from a random walk to a gradient ascent search.
Biobanks store genomic material from identifiable individuals. Recently many population-based studies have started sequencing genomic data from biobank samples and cross-linking the genomic data with clinical data, with the goal of discovering new insights into disease and clinical treatments. However, the use of genomic data for research has far-reaching implications for privacy and the relations between individuals and society. In some jurisdictions, primarily in Europe, new laws are being or have been introduced to legislate for the protection of sensitive data relating to individuals, and biobank-specific laws have even been designed to legislate for the handling of genomic data and the clear definition of roles and responsibilities for the owners and processors of genomic data. This paper considers the security questions raised by these developments. We introduce a new threat model that enables the design of cloud-based systems for handling genomic data according to privacy legislation. We also describe the design and implementation of a security framework using our threat model for BiobankCloud, a platform that supports the secure storage and processing of genomic data in cloud computing environments.
The Hadoop Distributed File System (HDFS) scales to store tens of petabytes of data despite the fact that the entire file system's metadata must fit on the heap of a single Java virtual machine. The size of HDFS' metadata is limited to under 100 GB in production, as garbage collection events in bigger clusters result in heartbeats timing out to the metadata server (NameNode). In this paper, we address the problem of how to migrate the HDFS' metadata to a relational model, so that we can support larger amounts of storage on a shared-nothing, in-memory, distributed database. Our main contribution is that we show how to provide at least as strong consistency semantics as HDFS while adding support for a multiple-writer, multiple-reader concurrency model. We guarantee freedom from deadlocks by logically organizing inodes (and their constituent blocks and replicas) into a hierarchy and having all metadata operations agree on a global order for acquiring both explicit locks and implicit locks on subtrees in the hierarchy. We use transactions with pessimistic concurrency control to ensure the safety and progress of metadata operations. Finally, we show how to improve performance of our solution by introducing a snapshotting mechanism at NameNodes that minimizes the number of roundtrips to the database.
Currently, the development of overlay network systems typically produces two software artifacts: a simulator to model key protocols and a production system for a WAN environment. However, this methodology requires the maintenance of two implementations, as well as adding both development overhead and the potential for errors, through divergence in the different code bases. This paper describes how our message-passing component model, called Kompics, is used to build overlay network systems using a P2P component framework, where the same implementation can be simulated or deployed in a production environment. Kompics enables two different modes of simulation: deterministic simulation for reproducible debugging, and emulation mode for stress-testing systems. We used our P2P component framework to build and evaluate overlay systems, and we show how our model lowers the programming barrier for simulating and deploying overlay network systems.
Hadoop is a popular system for storing, managing, and processing large volumes of data, but it has bare-bones internal support for metadata, as metadata is a bottleneck and less means more scalability. The result is a scalable platform with rudimentary access control that is neither user-nor developer friendly. Also, metadata services that are built on Hadoop, such as SQL-on-Hadoop, access control, data provenance, and data governance are necessarily implemented as eventually consistent services, resulting in increased development effort and more brittle software. In this paper, we present a new project-based multi-tenancy model for Hadoop, built on a new distribution of Hadoop that provides a distributed database backend for the Hadoop Distributed Filesystem's (HDFS) metadata layer. We extend Hadoop's metadata model to introduce projects, datasets, and project-users as new core concepts that enable a user-friendly, UI-driven Hadoop experience. As our metadata service is backed by a transactional database, developers can easily extend metadata by adding new tables and ensure the strong consistency of extended metadata using both transactions and foreign keys.
HopsFS is an open-source, next generation distribution of the Apache Hadoop Distributed File System(HDFS) that replaces the main scalability bottleneck in HDFS, single node in-memory metadata service, with a no-sharedstate distributed system built on a NewSQL database. By removing the metadata bottleneck in Apache HDFS, HopsFS enables significantly larger cluster sizes, more than an order of magnitude higher throughput, and significantly lower clientlatencies for large clusters. In this paper, we detail the techniques and optimizations that enable HopsFS to surpass 1 million file system operations per second-at least 16 times higher throughput than HDFS. In particular, we discuss how we exploit recent high performance features from NewSQL databases, such as application defined partitioning, partition-pruned index scans, and distribution aware transactions. Together with more traditional techniques, such as batching and write-Ahead caches, we show how many incremental optimizations have enabled a revolution in distributed hierarchical file system performance.
Distributed systems are becoming an increasingly important part of systems and applications software and it is widely accepted that writing correct distributed systems is challenging. Message-passing concurrency models are the dominant programming paradigm and, even in statically typed languages, programming frameworks typically only have limited type checking support for messages, channels, and ports or mailboxes. In this paper, we present Kola, a language-level implementation of Kompics, a component model with message-passing concurrency. Kola comes with its own compiler and some special language constructs which extend Java's type system as necessary to enforce static type checking on messages, channels, and ports. We show that Kola improves the readability of Kompics code and removes opportunities to introduce bugs, at the cost of little compile time overhead and no runtime overhead.
Distributed applications deployed in multi-datacenter environments need to deal with network connections of varying quality, including high bandwidth and low latency within a datacenter and, more recently, high bandwidth and high latency between datacentres. In principle, for a given network connection, each message should be sent over the best available network protocol, but existing middlewares do not provide this functionality. In this paper, we present KompicsMessaging, a messaging middleware that allows for fine-grained control of the network protocol used on a per-message basis. Rather than always requiring application developers to specify the appropriate protocol for each message, we also provide an online reinforcement learner that optimises the selection of the network protocol for the current network environment. In experiments, we show how connection properties, such as the varying round-trip time, influence the performance of the application and we show how throughput and latency can be improved by picking the right protocol at the right time.
Leader election protocols are a fundamental building blockfor replicated distributed services. They ease the design of leader-basedcoordination protocols that tolerate failures. In partially synchronoussystems, designing a leader election algorithm, that does not permit mul-tiple leaders while the system is unstable, is a complex task. As a resultmany production systems use third-party distributed coordination ser-vices, such as ZooKeeper and Chubby, to provide a reliable leader electionservice. However, adding a third-party service such as ZooKeeper to adistributed system incurs additional operational costs and complexity.ZooKeeper instances must be kept running on at least three machinesto ensure its high availability. In this paper, we present a novel leaderelection protocol using NewSQL databases for partially synchronous sys-tems, that ensures at most one leader at any given time. The leaderelection protocol uses the database as distributed shared memory. Ourwork enables distributed systems that already use NewSQL databasesto save the operational overhead of managing an additional third-partyservice for leader election. Our main contribution is the design, imple-mentation and validation of a practical leader election algorithm, basedon NewSQL databases, that has performance comparable to a leaderelection implementation using a state-of-the-art distributed coordinationservice, ZooKeeper
Recent improvements in both the performance and scalability of shared-nothing, transactional, in-memory NewSQL databases have reopened the research question of whether distributed metadata for hierarchical file systems can be managed using commodity databases. In this paper, we introduce HopsFS, a next generation distribution of the Hadoop Distributed File System (HDFS) that replaces HDFS’ single node in-memory metadata service, with a distributed metadata service built on a NewSQL database. By removing the metadata bottleneck, HopsFS enables an order of magnitude larger and higher throughput clusters compared to HDFS. Metadata capacity has been increased to at least 37 times HDFS’ capacity, and in experiments based on a workload trace from Spotify, we show that HopsFS supports 16 to 37 times the throughput of Apache HDFS. HopsFS also has lower latency for many concurrent clients, and no downtime during failover. Finally, as metadata is now stored in a commodity database, it can be safely extended and easily exported to external systems for online analysis and free-text search.
The Hadoop Distributed File System (HDFS) is designed to handle massive amounts of data, preferably stored in very large files. The poor performance of HDFS in managing small files has long been a bane of the Hadoop community. In many production deployments of HDFS, almost 25% of the files are less than 16 KB in size and as much as 42% of all the file system operations are performed on these small files. We have designed an adaptive tiered storage using in-memory and on-disk tables stored in a high-performance distributed database to efficiently store and improve the performance of the small files in HDFS. Our solution is completely transparent, and it does not require any changes in the HDFS clients or the applications using the Hadoop platform. In experiments, we observed up to 61~times higher throughput in writing files, and for real-world workloads from Spotify our solution reduces the latency of reading and writing small files by a factor of 3.15 and 7.39 respectively.
Big data has, in recent years, revolutionised an evergrowing number of fields, from machine learning to climate science to genomics. The current state-of-the-art for storing large datasets is either object stores or distributed filesystems, with Hadoop being the dominant open-source platform for managing 'Big Data'. Existing large-scale storage platforms, however, lack support for the efficient sharing of large datasets over the Internet. Those systems that are widely used for the dissemination of large files, like BitTorrent, need to be adapted to handle challenges such as network links with both high latency and high bandwidth, and scalable storage backends that are optimised for streaming and not random access. In this paper, we introduce Dela, a peer-to-peer data-sharing service integrated into the Hops Hadoop platform that provides an end-to-end solution for dataset sharing. Dela is designed for large-scale storage backends and data transfers that are both non-intrusive to existing TCP network traffic and provide higher network throughput than TCP on high latency, high bandwidth network links, such as transatlantic network links. Dela provides a pluggable storage layer, implementing two alternative ways for clients to access shared data: stream processing of data as it arrives with Kafka, and traditional offline access to data using the Hadoop Distributed Filesystem. Dela is the first step for the Hadoop platform towards creating an open dataset ecosystem that supports user-friendly publishing, searching, and downloading of large datasets.
Despite much recent research on peer-to-peer (P2P) protocols for the Internet, there have been relatively few practical protocols designed to explicitly account for Network Address Translation gateways (NATs). Those P2P protocols that do handle NATs circumvent them using relaying and hole-punching techniques to route packets to nodes residing behind NATs. In this paper, we present Croupier, a peer sampling service (PSS) that provides uniform random samples of nodes in the presence of NATs in the network. It is the first NAT-aware PSS that works without the use of relaying or hole-punching. By removing the need for relaying and hole-punching, we decrease the complexity and overhead of our protocol as well as increase its robustness to churn and failure. We evaluated Croupier in simulation, and, in comparison with existing NAT-aware PSS', our results show similar randomness properties, but improved robustness in the presence of both high percentages of nodes behind NATs and massive node failures. Croupier also has substantially lower protocol overhead.
Peer-to-peer live media streaming over the Internet is becoming increasingly more popular, though it is still a challenging problem. Nodes should receive the stream with respect to intrinsic timing constraints, while the overlay should adapt to the changes in the network and the nodes should be incentivized to contribute their resources. In this work, we meet these contradictory requirements simultaneously, by introducing a distributed market model to build an efficient overlay for live media streaming. Using our market model, we construct two different overlay topologies, tree-based and mesh-based, which are the two dominant approaches to the media distribution. First, we build an approximately minimal height multiple-tree data dissemination overlay, called Sepidar. Next, we extend our model, in GLive, to make it more robust in dynamic networks by replacing the tree structure with a mesh. We show in simulation that the mesh-based overlay outperforms the multiple-tree overlay. We compare the performance of our two systems with the state-of-the-art NewCoolstreaming, and observe that they provide better playback continuity and lower playback latency than that of NewCoolstreaming under a variety of experimental scenarios. Although our distributed market model can be run against a random sample of nodes, we improve its convergence time by executing it against a sample of nodes taken from the Gradient overlay. The evaluations show that the streaming overlays converge faster when our market model works on top of the Gradient overlay.
State of the art gossip protocols for the Internet are based on the assumption that connection establishment between peers comes at negligible cost. Our experience with commercially deployed P2P systems has shown that this cost is much higher than generally assumed. As such, peer sampling services often cannot provide fresh samples because the service would require too high a connection establishment rate. In this paper, we present the wormhole-based peer sampling service (WPSS). WPSS overcomes the limitations of existing protocols by executing short random walks over a stable topology and by using shortcuts (wormholes), thus limiting the rate of connection establishments and guaranteeing freshness of samples, respectively.We show that our approach can decrease the connection establishment rate by one order of magnitude compared to the state of the art while providing the same levels of freshness of samples. This, without sacrificing the desirable properties of a PSS for the Internet, such as robustness to churn and NAT-friendliness. We support our claims with a thorough measurement study in our deployed commercial system as well as in simulation.
Service-oriented computing is becoming an increasingly popular paradigm for modelling and building distributed systems in open and heterogeneous environments. However, proposed service-oriented architectures are typically based on centralised components, such as service registries or service brokers, that introduce reliability, management, and performance issues. This paper describes an approach to fully decentralise a service-oriented architecture using a self-organising peer-to-peer network maintained by service providers and consumers. The design is based on a gradient peer-to-peer topology, which allows the system to replicate a service registry using a limited number of the most stable and best performing peers. The paper evaluates the proposed approach through extensive simulation experiments and shows that the decentralised registry and the underlying peer-to-peer infrastructure scale to a large number of peers and can successfully manage high peer churn rates.
Objective: We provide an e-Science perspective on the workflow from risk factor discovery and classification of disease to evaluation of personalized intervention programs. As case studies, we use personalized prostate and breast cancer screenings. Materials and Methods: We describe an e-Science initiative in Sweden, e-Science for Cancer Prevention and Control (eCPC), which supports biomarker discovery and offers decision support for personalized intervention strategies. The generic eCPC contribution is a workflow with 4 nodes applied iteratively, and the concept of e-Science signifies systematic use of tools from the mathematical, statistical, data, and computer sciences. Results: The eCPC workflow is illustrated through 2 case studies. For prostate cancer, an in-house personalized screening tool, the Stockholm-3 model (S3M), is presented as an alternative to prostate-specific antigen testing alone. S3M is evaluated in a trial setting and plans for rollout in the population are discussed. For breast cancer, new biomarkers based on breast density and molecular profiles are developed and the US multicenter Women Informed to Screen Depending on Measures (WISDOM) trial is referred to for evaluation. While current eCPC data management uses a traditional data warehouse model, we discuss eCPC-developed features of a coherent data integration platform. Discussion and Conclusion: E-Science tools are a key part of an evidence-based process for personalized medicine. This paper provides a structured workflow from data and models to evaluation of new personalized intervention strategies. The importance of multidisciplinary collaboration is emphasized. Importantly, the generic concepts of the suggested eCPC workflow are transferrable to other disease domains, although each disease will require tailored solutions.