High-productivity Cloud computing: October 2012

In this multi-part series, we discuss several cloud computing projects that we have completed over the course of the past 12 months. They range from experiments in computational science and engineering to full production systems for cloud service providers. This week, we'll discuss a cloud collaboration solution that was designed to bring together collaborators at different bioinformatics labs, and to connect them to an elastic cloud computing facility that offers FPGA accelerated servers, which is a facility that is increasingly important to solve the exponentially increasing compute load for bioinformatics research.

NEXT GENERATION BIOINFORMATICS

The importance of next generation sequencers (NGS) to personalized healthcare opportunities is well documented [1]. Conventional medical diagnosis relies on the analysis of the patient's personal and family history, symptoms, and samples. The goal of personalized health care is to strengthen the diagnostics by including comparisons between the patient's genome and a global reference database of known disease markers. Sample testing will also be enhanced through highly sensitive measurement of adaptive immune system parameters, and molecular-level monitoring of the progression of different cancers. Outside human health, NGS will touch every aspect of our understanding of the living world, and help improve food safety, control pests, reduce pollution, find alternate sources of energy, etc.

At the lowest level, the bioinformatic questions fall into two broad classes: (i) comparing next-gen reads against reference genomes, and (ii) assembling next-gen sequences to build reference genomes. Finding scaleable and cost-effective solutions to these questions is a major challenge. Even the best algorithms need large amounts of fast and costly computer memory (50-1024GB), and thousands of processor-hours to complete analysis on mammalian-sized genomes. Reducing execution time will improve the quality of results by allowing the analysis to be run with many different parameters.

The computational landscape is changing rapidly with discovery of new algorithms, and the proliferation of reference genomes. This puts tremendous pressure on the data management and computational complexity for which a seamless solution needs to be found.

Our client was a start-up that was designing a new instrument. Our client's research instrument was able to generate up to 100GBytes of data per day, and would run continuously for several days for a single study. The raw data had to go through several stages of deep analytics to identify and correct data acquisition errors, apply data reduction and reconstruction operators, and keep track of historical performance to identify the best algorithms and to manage the research direction and software development. The algorithms required high throughput compute clusters, and the R&D talent that developed these algorithms was geographically dispersed throughout the US. The management and procurement of these clusters was beyond the capability or capital resources of this start-up, and on-demand cloud services could provide solutions to this CAPEX problem. However, to integrate multiple remote laboratories into a productive collaborate space required a robust and cost-effective solution to file replication so that each instrument, and all subsequent analysis results, would be readily available to the distributed team.

NEXT GENERATION CLOUD INFRASTRUCTURE

The core operation of the instrument was based on data acquisition of high-frequency RF signals and the use of chemicals to prepare samples. This nature demanded that the physical machine resided in a well-controlled laboratory environment, a very different environment than a typical data center room. The compute demands of the research algorithms were roughly in the 50TOPS range and highly latency sensitive. The cost to create that raw hardware capacity was of the order of $150k per lab, not including operational staff. The utilization of that equipment would have been very low, particularly in the beginning phases of the project when the instrument would only generate data once per month. Allocating the compute capacity for the research algorithms in a remote cloud solves the capex and utilization problem, but we would introduce a data movement problem. What did this trade-off look like?

We evaluated two organizations:

instrument and compute cluster per lab
instrument per lab, compute cluster in a remote cloud

The data retention was not a key attribute, and thus the cost of data storage for backups was not a factor in the design. However, link bandwidth was a key attribute. The labs were at the end of relatively low bandwidth links and increasing bandwidth required a 90 day waiting period with hard caps due to infrastructure limits. The links in place were limited to 5Mb/s!!! Not very impressive, but surprisingly common. Increasing the bandwidth would have cost $30k/year extra and the 90 days waiting also made this unattractive.

The capex of option 1 with two labs was about $300k with about 4 weeks of turn around time. That capex would go away in option 2, and the turn around time was reduced to days to get up and running. However, at 5Mb/s, moving a 50GB file to a remote cloud would take several days, and worse, this cost would have to be paid every data acquisition. However, the early research instrument would take several days for a data collection, so the labs workflow was already used to having to a long latency between experiment design and data collection. But, more importantly, if the instrument takes several days to collect a large data set, if we need to migrate that data to a remote location, we want to overlap data acquisition with data transfer. Typical web application protocols don't work well in this regard, so HTTP and FTP are not attractive. The idea for the solution came from our use of Git. Git is a versioning system that is based on snapshotting a file system: that model is exactly the right technology as it is seamless and robust. This snapshotting idea lead us to Nasuni, which provides a filer that snapshots to Amazon S3 and integrates this with additional enterprise features such as volume management and volume replication configuration. Nasuni is relatively expensive, starting at $10k/TB/year, but the flexibility and set-and-forget feature set made it very attractive. The question was whether or not file system snapshotting would work with very low bandwidth links. If the replication performance was sufficient, then managing the persistent data volume that would determine cost would be trivial.

SYSTEM ARCHITECTURE

To create a robust and reliable next generation IT infrastructure, we designed and implemented a real-time distributed collaboration grid, as depicted in the figure below.

Each lab is allocated a Nasuni filer, with a sufficiently large file cache to support local working set caching. The Nasuni service creates a system where each filer becomes part of a distributed file system that is created by each filer replicating and receiving snapshots to and from the cloud storage service. The filers encrypt the snapshots and send them to the cloud store filer, which replicates the snapshots at the other locations. The cloud store lives on Amazon Web Services as an encrypted object store to provide high security and high availability. Given normal AWS S3 replication parameters, this store represents an almost perfect reliability platform. Nasuni guarantees 100% availability, which derives from the eleven-nines that Amazon offers on its S3 storage platform. (Amazon uses the term 'durable', which is more specific, as availability also incorporates connectivity, which Amazon can't guarantee)

The filers can be configured to snapshot and replicate in a real-time fashion for small delta data changes, such as scientist work spaces. For data sets that see a lot of change, such as the output of data collection instruments, it is more appropriate to snapshot at lower frequencies to create good traffic attributes during replication. Furthermore, since our client's data sets are generated over multiple days, it is not useful to replicate these data sets in real-time as the data is not complete and the overhead is too high.

We also allocated a filer at the high performance computing cloud provider. The role of this filer is different from the filers at the laboratories. Whereas the filers at the laboratories function as a buffer for the data generation instruments, the filer at the compute cloud functions as an aggregation and working set manager. At the compute cloud, bioinformatic applications are run that use all, or large subsets of the data, and produce large, new sets of results. These results need to be replicated to the laboratory filers. The aggregation data can be held in a big data store, such as Hadoop or Riak, so that the CSP filer snapshot cache is not over-committed during these deep analytics compute cycles.

Conclusions

To test the data replication performance, we used a 60GB data set that was generated by the instrument over a two day period. We used a volume snapshot rate of three times a day, and we throttled the filers bandwidth to only 4Mb/s so that we would not severely impact normal Internet traffic. At our Cloud Service Provider, we ran a VMware-hosted filer as the target. The Nasuni filer received our data set in roughly 35 hours. This experiment demonstrated that even with a high volume instrument the data movement delay to a CSP was not a significant source of delay. When the instrument would improve it data acquisition rates the link rates could be improved to match this increased performance.

Once we had proven that data movement using the Nasuni filers and storage service had reasonable performance even on very low bandwidth links, the additional features provided by the Nasuni storage service made this solution very attractive. IT personal can configure the filers and the volume parameters, and manage these against the IT resources available, independent from the research teams. Those research teams are presented with effectively an infinite storage resource and complete flexibility to compute locally, or in the cloud, all the while knowing that whatever results they produce, the data is visible to all collaborators without having to manually engage upload/download workflows.

The one attribute that is of concern with this cloud collaboration architecture is the cost of storage. At higher storage capacity the Nasuni service drops down in cost significantly, but the assumption behind the Nasuni service is that all data is enterprise critical. In research oriented environments, data sets tend to grow rapidly and unbounded, but none of it is enterprise critical. To not break the bank, we must introduce active storage management. Once data has been analyzed and transformed, it should be copied to local NAS filers, and removed from the Nasuni service. This provides a solution that controls cost, and as a positive side effect, it keeps the data sets actively managed.

The end result of this project was that we saved the client $300k in capex and the overhead of managing two hardware clusters. The clusters would have been essential for creating the value-proposition of the client's innovation, but once the core analytics algorithm was developed, would have gone unused after the first year. The high data expansion volume of the instrument made a cloud computing solution non-trivial, but using the Nasuni storage service provided a set-and-forget solution to the data distribution problem.

References

A. Kahvejian, J. Quackenbush, and J.F. Thompson, What would you do if you could sequence everything?, Nature Biotechnology, 2008, Vol. 26, pp 1125-1133, http://www.nature.com/nbt/journal/v26/n10/full/nbt1494.html