NEXT GENERATION BIOINFORMATICS
NEXT GENERATION CLOUD INFRASTRUCTURE
- instrument and compute cluster per lab
- instrument per lab, compute cluster in a remote cloud
The data retention was not a key attribute, and thus the cost of data storage for backups was not a factor in the design. However, link bandwidth was a key attribute. The labs were at the end of relatively low bandwidth links and increasing bandwidth required a 90 day waiting period with hard caps due to infrastructure limits. The links in place were limited to 5Mb/s!!! Not very impressive, but surprisingly common. Increasing the bandwidth would have cost $30k/year extra and the 90 days waiting also made this unattractive.
The capex of option 1 with two labs was about $300k with about 4 weeks of turn around time. That capex would go away in option 2, and the turn around time was reduced to days to get up and running. However, at 5Mb/s, moving a 50GB file to a remote cloud would take several days, and worse, this cost would have to be paid every data acquisition. However, the early research instrument would take several days for a data collection, so the labs workflow was already used to having to a long latency between experiment design and data collection. But, more importantly, if the instrument takes several days to collect a large data set, if we need to migrate that data to a remote location, we want to overlap data acquisition with data transfer. Typical web application protocols don't work well in this regard, so HTTP and FTP are not attractive. The idea for the solution came from our use of Git. Git is a versioning system that is based on snapshotting a file system: that model is exactly the right technology as it is seamless and robust. This snapshotting idea lead us to Nasuni, which provides a filer that snapshots to Amazon S3 and integrates this with additional enterprise features such as volume management and volume replication configuration. Nasuni is relatively expensive, starting at $10k/TB/year, but the flexibility and set-and-forget feature set made it very attractive. The question was whether or not file system snapshotting would work with very low bandwidth links. If the replication performance was sufficient, then managing the persistent data volume that would determine cost would be trivial.
The filers can be configured to snapshot and replicate in a real-time fashion for small delta data changes, such as scientist work spaces. For data sets that see a lot of change, such as the output of data collection instruments, it is more appropriate to snapshot at lower frequencies to create good traffic attributes during replication. Furthermore, since our client's data sets are generated over multiple days, it is not useful to replicate these data sets in real-time as the data is not complete and the overhead is too high.
We also allocated a filer at the high performance computing cloud provider. The role of this filer is different from the filers at the laboratories. Whereas the filers at the laboratories function as a buffer for the data generation instruments, the filer at the compute cloud functions as an aggregation and working set manager. At the compute cloud, bioinformatic applications are run that use all, or large subsets of the data, and produce large, new sets of results. These results need to be replicated to the laboratory filers. The aggregation data can be held in a big data store, such as Hadoop or Riak, so that the CSP filer snapshot cache is not over-committed during these deep analytics compute cycles.
- A. Kahvejian, J. Quackenbush, and J.F. Thompson, What would you do if you could sequence everything?, Nature Biotechnology, 2008, Vol. 26, pp 1125-1133, http://www.nature.com/nbt/journal/v26/n10/full/nbt1494.html