Friday, January 25, 2013

On-Demand Supercomputing

Except for the folks at Cray, most people are unaware of the unique requirements that set apart supercomputing infrastructure from cloud computing infrastructure. In its simplest form the difference is between latency and capacity. For business intelligence applications such as optimization and logistics many servers are required to solve a single problem, and low latency communication between the servers is instrumental for performance. The intuition behind this is easy to understand: a modern microprocessor executes 4-5 instructions per 250ps, and thus packet latencies of 10GbE, (between 5-50usec), are roughly equivalent to 100k to 1M processor instructions. If a processor is dependent on the results computed by another processor, it will have to idle till the data is available. Cumulatively, across a couple hundred servers, this can lead to peak performance that is only 1-5% of peak.

Supercomputing applications are defined by these types of tightly connected concurrent processes, putting more emphasis on the performance of the interconnect, in particularly the latency. Running a traditional supercomputing application on an infrastructure designed for elastic applications, such as AWS or Azure, typically yield slow-downs by a factor 50 to 100. Measured in terms of cost, they would cost 50-100 times more to execute on a typical public cloud computing infrastructure.

Most supercomputing applications are associated with very valuable economic activities of the business. As mentioned earlier, production optimization and logistics applications save companies like Exxon Mobil and Fedex billions of dollars per year. Those applications are tightly integrated in the business operation and strategic decision making of these organizations and pay for themselves many times over. However, for the SMB market these supercomputing applications offer great opportunity for revenue growth and margin improvements as well. However, their economic value is attenuated by the revenue stream they optimize; 10% improvement for a $10B revenue stream yields a $1B net benefit, but for a $10M revenue stream the benefit is just a $1M, not enough to compensate for the risk and cost that deploying a supercomputer would require.

Enter On-Demand Supercomputing.

In 2011, we were asked to design, construct, and deploy an On-Demand supercomputing service for a Chinese cloud vendor. The idea was to build an interconnected set of supercomputer centers in China, and offer a multi-tenant on-demand service for high-value, high-touch applications, such as logistics, digital content creation, and engineering design and optimization. The pilot program consisted of a supercomputer center in Beijing and one in Shanghai. The basic building block that was designed was a quad rack, redundant QDR IB fat-tree architecture with blade chassis at the leaves. The architecture was inspired by the observation that for the SMB market, the granularity of deployment would fall in the range of 16 to 32 processors, which would be serviced by a single chassis, keeping all communication traffic local to the chassis. The topology is shown in the following figure:
Redundant QDR IB Network Topology for On-Demand Supercomputing
The chassis structure is easy to spot as the clusters of 20 servers at the leaves of the tree. The redundancy of the IB network is also clearly visible by the pairs of connections between all the layers in the tree. The quad configuration is a two rack symmetric setup, one pair holding one side of the redundant IB network/storage/computes. So half the quad can fall away, and the system would still have full connectivity between storage and computes. To lower the cost of the system, storage was designed around IB-based storage servers that plugged into the same infrastructure as the compute nodes. QDR throughput is balanced with PCIe gen2 and thus we were able to deliver ephemeral blades that get their personality from the storage servers and then dynamically connect via iSCSI services to whatever storage volumes they require. This is less expensive than designing a separate NAS storage subsystem, and it gives the infrastructure flexibility to build high-performance storage solutions. It was this system that set a new world record by being the first trillion triple semantic database system leveraging a Lustre file system consisting of 8 storage servers (trillion-triple-semantic-database-record).

The provisioning of on-demand supercomputing infrastructure is bare metal, mostly to avoid any of the I/O latency degradation that virtualization injects. Given the symmetry between storage and compute and the performance offered by QDR IB, a network boot mechanism can be used to put any personality on the blades without any impact on performance. The blades have local disk for scratch space, but run their OS and data volumes off the storage servers, thus avoiding the problem of DR of state on the blades.

The QDR IB infrastructure was based on Voltair switches and Mellanox HCAs. Intel helped us tune the infrastructure, using their cluster libraries for the processors we were using, and Mellanox was instrumental in getting the IB switches in shape. Over a three week period, we went from 60% efficiency to about 94% efficiency. The full quad has a peak performance of 19.2TFlops and after tuning the infrastructure we were able to consistently deliver 18TFlops of sustained performance.

The total cost of the core system was of the order of $3.6M. The On-Demand Supercomputing service offers a full dual socket server with 64GB of memory for about $5/hr, providing a cost-effective service for SMBs interested in leveraging high performance computing. For example, a digital content creation firm in Beijing leveraged about 100 servers as burst capacity for post-production. Their monthly cost to leverage a state of the art supercomputer was less than $20k per month. Similarly, a material science application was developed by a chemical manufacturer to study epitaxial growth. This allowed the manufacturer to optimize the process parameters for a thin-film process that would not have been cost-effective on a cloud infrastructure designed for elastic web applications.

The take-away of this project is echoing the findings in the missing middle reports for digital manufacturing (Digital Manufacturing Report). There is tremendous opportunity for SMBs to improve business operations by leveraging the same techniques as their enterprise brethren. But the cost of commercial software for HPC is not consistent with the value provided for SMBs. Furthermore, the IT and operational skills required both to setup and manage a supercomputing infrastructure is beyond the capabilities of most SMBs. On-demand HPC services, as we have demonstrated with the supers in Beijing and Shanghai, can overcome many of these issues. Most importantly, it enables a new level of innovation by domain experts, such as professors and independent consultants, who do have the skills necessary to leverage supercomputing techniques, but up to now have not had access to public supercomputing capability and services.





Thursday, October 4, 2012

Cloud Collaboration: Bioinformatics


In this multi-part series, we discuss several cloud computing projects that we have completed over the course of the past 12 months. They range from experiments in computational science and engineering to full production systems for cloud service providers. This week, we'll discuss a cloud collaboration solution that was designed to bring together collaborators at different bioinformatics labs, and to connect them to an elastic cloud computing facility that offers FPGA accelerated servers, which is a facility that is increasingly important to solve the exponentially increasing compute load for bioinformatics research. 

NEXT GENERATION BIOINFORMATICS


The importance of next generation sequencers (NGS) to personalized healthcare opportunities is well documented [1]. Conventional medical diagnosis relies on the analysis of the patient's personal and family history, symptoms, and samples. The goal of personalized health care is to strengthen the diagnostics by including comparisons between the patient's genome and a global reference database of known disease markers. Sample testing will also be enhanced through highly sensitive measurement of adaptive immune system parameters, and molecular-level monitoring of the progression of different cancers. Outside human health, NGS will touch every aspect of our understanding of the living world, and help improve food safety, control pests, reduce pollution, find alternate sources of energy, etc.

At the lowest level, the bioinformatic questions fall into two broad classes: (i) comparing next-gen reads against reference genomes, and (ii) assembling next-gen sequences to build reference genomes. Finding scaleable and cost-effective solutions to these questions is a major challenge. Even the best algorithms need large amounts of fast and costly computer memory (50-1024GB), and thousands of processor-hours to complete analysis on mammalian-sized genomes. Reducing execution time will improve the quality of results by allowing the analysis to be run with many different parameters.

The computational landscape is changing rapidly with discovery of new algorithms, and the proliferation of reference genomes. This puts tremendous pressure on the data management and computational complexity for which a seamless solution needs to be found.

Our client was a start-up that was designing a new instrument. Our client's research instrument was able to generate up to 100GBytes of data per day, and would run continuously for several days for a single study. The raw data had to go through several stages of deep analytics to identify and correct data acquisition errors, apply data reduction and reconstruction operators, and keep track of historical performance to identify the best algorithms and to manage the research direction and software development. The algorithms required high throughput compute clusters, and the R&D talent that developed these algorithms was geographically dispersed throughout the US. The management and procurement of these clusters was beyond the capability or capital resources of this start-up, and on-demand cloud services could provide solutions to this CAPEX problem. However, to integrate multiple remote laboratories into a productive collaborate space required a robust and cost-effective solution to file replication so that each instrument, and all subsequent analysis results, would be readily available to the distributed team.

NEXT GENERATION CLOUD INFRASTRUCTURE

The core operation of the instrument was based on data acquisition of high-frequency RF signals and the use of chemicals to prepare samples. This nature demanded that the physical machine resided in a well-controlled laboratory environment, a very different environment than a typical data center room. The compute demands of the research algorithms were roughly in the 50TOPS range and highly latency sensitive. The cost to create that raw hardware capacity was of the order of $150k per lab, not including operational staff. The utilization of that equipment would have been very low, particularly in the beginning phases of the project when the instrument would only generate data once per month.  Allocating the compute capacity for the research algorithms in a remote cloud solves the capex and utilization problem, but we would introduce a data movement problem. What did this trade-off look like?

We evaluated two organizations:
  1. instrument and compute cluster per lab
  2. instrument per lab, compute cluster in a remote cloud


The data retention was not a key attribute, and thus the cost of data storage for backups was not a factor in the design. However, link bandwidth was a key attribute. The labs were at the end of relatively low bandwidth links and increasing bandwidth required a 90 day waiting period with hard caps due to infrastructure limits. The links in place were limited to 5Mb/s!!! Not very impressive, but surprisingly common. Increasing the bandwidth would have cost $30k/year extra and the 90 days waiting also made this unattractive.

The capex of option 1 with two labs was about $300k with about 4 weeks of turn around time. That capex would go away in option 2, and the turn around time was reduced to days to get up and running. However, at 5Mb/s, moving a 50GB file to a remote cloud would take several days, and worse, this cost would have to be paid every data acquisition. However, the early research instrument would take several days for a data collection, so the labs workflow was already used to having to a long latency between experiment design and data collection. But, more importantly, if the instrument takes several days to collect a large data set, if we need to migrate that data to a remote location, we want to overlap data acquisition with data transfer. Typical web application protocols don't work well in this regard, so HTTP and FTP are not attractive. The idea for the solution came from our use of Git. Git is a versioning system that is based on snapshotting a file system: that model is exactly the right technology as it is seamless and robust. This snapshotting idea lead us to Nasuni, which provides a filer that snapshots to Amazon S3 and integrates this with additional enterprise features such as volume management and volume replication configuration. Nasuni is relatively expensive, starting at $10k/TB/year, but the flexibility and set-and-forget feature set made it very attractive. The question was whether or not file system snapshotting would work with very low bandwidth links. If the replication performance was sufficient, then managing the persistent data volume that would determine cost would be trivial.

SYSTEM ARCHITECTURE


 To create a robust and reliable next generation IT infrastructure, we designed and implemented a real-time distributed collaboration grid, as depicted in the figure below.



Each lab is allocated a Nasuni filer, with a sufficiently large file cache to support local working set caching. The Nasuni service creates a system where each filer becomes part of a distributed file system that is created by each filer replicating and receiving snapshots to and from the cloud storage service. The filers encrypt the snapshots and send them to the cloud store filer, which replicates the snapshots at the other locations. The cloud store lives on Amazon Web Services as an encrypted object store to provide high security and high availability. Given normal AWS S3 replication parameters, this store represents an almost perfect reliability platform. Nasuni guarantees 100% availability, which derives from the eleven-nines that Amazon offers on its S3 storage platform. (Amazon uses the term 'durable', which is more specific, as availability also incorporates connectivity, which Amazon can't guarantee)

The filers can be configured to snapshot and replicate in a real-time fashion for small delta data changes, such as scientist work spaces. For data sets that see a lot of change, such as the output of data collection instruments, it is more appropriate to snapshot at lower frequencies to create good traffic attributes during replication. Furthermore, since our client's data sets are generated over multiple days, it is not useful to replicate these data sets in real-time as the data is not complete and the overhead is too high.

We also allocated a filer at the high performance computing cloud provider. The role of this filer is different from the filers at the laboratories. Whereas the filers at the laboratories function as a buffer for the data generation instruments, the filer at the compute cloud functions as an aggregation and working set manager. At the compute cloud, bioinformatic applications are run that use all, or large subsets of the data, and produce large, new sets of results. These results need to be replicated to the laboratory filers. The aggregation data can be held in a big data store, such as Hadoop or Riak, so that the CSP filer snapshot cache is not over-committed during these deep analytics compute cycles.

Conclusions

To test the data replication performance, we used a 60GB data set that was generated by the instrument over a two day period. We used a volume snapshot rate of three times a day, and we throttled the filers bandwidth to only 4Mb/s so that we would not severely impact normal Internet traffic. At our Cloud Service Provider, we ran a VMware-hosted filer as the target. The Nasuni filer received our data set in roughly 35 hours. This experiment demonstrated that even with a high volume instrument the data movement delay to a CSP was not a significant source of delay. When the instrument would improve it data acquisition rates the link rates could be improved to match this increased performance. 

Once we had proven that data movement using the Nasuni filers and storage service had reasonable performance even on very low bandwidth links, the additional features provided by the Nasuni storage service made this solution very attractive. IT personal can configure the filers and the volume parameters, and manage these against the IT resources available, independent from the research teams. Those research teams are presented with effectively an infinite storage resource and complete flexibility to compute locally, or in the cloud, all the while knowing that whatever results they produce, the data is visible to all collaborators without having to manually engage upload/download workflows.

The one attribute that is of concern with this cloud collaboration architecture is the cost of storage. At higher storage capacity the Nasuni service drops down in cost significantly, but the assumption behind the Nasuni service is that all data is enterprise critical. In research oriented environments, data sets tend to grow rapidly and unbounded, but none of it is enterprise critical. To not break the bank, we must introduce active storage management. Once data has been analyzed and transformed, it should be copied to local NAS filers, and removed from the Nasuni service. This provides a solution that controls cost, and as a positive side effect, it keeps the data sets actively managed.

The end result of this project was that we saved the client $300k in capex and the overhead of managing two hardware clusters. The clusters would have been essential for creating the value-proposition of the client's innovation, but once the core analytics algorithm was developed, would have gone unused after the first year. The high data expansion volume of the instrument made a cloud computing solution non-trivial, but using the Nasuni storage service provided a set-and-forget solution to the data distribution problem.

References

  1. A. Kahvejian, J. Quackenbush, and J.F. Thompson, What would you do if you could sequence everything?, Nature Biotechnology, 2008, Vol. 26, pp 1125-1133, http://www.nature.com/nbt/journal/v26/n10/full/nbt1494.html

Wednesday, October 3, 2012

Cloud Collaboration

The past 12 months, we have implemented a handful of global cloud platforms that connect US, EU, and APAC. The common impetus behind these projects is to connect brain trusts in these geographies. Whether they are supply chains in Asia program managed from the EU, healthcare cost improvements in the US by using radiologists in India, or high-tech design teams that are collaborating on a new car or smart phone design, all these efforts are trying to implement the IT platform to create the global village.

The teachings provided by these implementations are that cloud computing is more or less a solved problem, but cloud collaboration is far from done. Cloud collaboration from an architecture point of view is similar to the constraints faced by mobile application platforms, so there is no doubt that in the next couple of years we'll see lots of nascent solutions to the fundamental problem of mobility and cloud collaboration: data movement.

The data sets in our US-China project measured in the range from tens to hundreds of TBytes, but data expansion was modest at a couple of GBytes a day. For a medical cloud computing project, the data set was more modest at 35TBytes, but the data expansion of these data sets could be as high as 100GB per day, fueled by high volume instruments, such as MRI or NGS machines. In the US-China collaboration, the problem was network latency and packet loss, whereas in the medical cloud computing project, the problem was how to deal with multi-site high-volume data expansions. The cloud computing aspect of all these projects was literally less than a couple of man weeks worth of work. The cloud collaboration aspect of these projects all required completely new technology developments.

In the next few weeks, I'll describe the different projects, their business requirements, their IT architecture manifestation, and the key technologies that we had to develop to deliver their business value.


Thursday, September 29, 2011

Amazon Silk: split browser architecture

Amazon Silk

Content Delivery Networks, and WAN optimization, provided a generic acceleration solution to get common content closer to the client device, but on mobile devices the delivery performance of the last mile was still a problem. Many websites still do not have mobile optimized content, and sucking down a 3Mpixel JPG and render it on a 320x240 pixel display is just plain wrong. With the introduction of Amazon Silk, which uses the cloud to aggregate, cache, precompile, and predict, the client-side experience can now be optimized for the device that everybody glamours for: the tablet.

This is going to create an even bigger disconnect between the consumer IT experience and the enterprise IT experience. On the Amazon Fire you will be able to pull up, nearly instantaneously, common TV video clips and connect to millions of books. But most enterprises will find it difficult to invest in WAN optimization gear that would replicate that experience on the corporate network for your day to day work.

Amazon Silk is another example of the power that the cloud provides for doing heavy computes and caching that enables low-capability devices to roam.

Wednesday, September 14, 2011

Trillion Triple Semantic Database

The Semantic Web captures the semantics, or meaning, of data, and machines are enabled to interact with that meta data. It is an idea of WWW pioneer Tim Berners-Lee who observed that although search engines index much of the Web's content, keywords can only provide an indirect association to the meaning of the article's content. He foresees a number of ways in which developers and authors can create and use the semantic web to help context-understanding programs to better serve knowledge discovery.

Tim Berners-Lee originally expressed the vision of the Semantic Web as follows:
I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The intelligent agents people have touted for ages will finally materialize.

The world of semantic databases just got a little bit more interesting with the announcement by Franz, Inc. and Stillwater SC of having reached a trillion triple semantic data store for telecommunication data.

http://www.franz.com/about/press_room/trillion-triples.lhtml

The database was constructed with an HPC on-demand cloud service and occupied 8 compute servers and 8 storage servers. The compute servers contained dual socket Xeons with 64GB of memory connecting through an QDR IB network to a 300TB SAN. The trillion triple data set spanned roughly 100TB of storage. It took roughly two weeks to load the data, but after that database provided interactive query rates for knowledge discovery and data mining.

The gear on which this result was produced is traditional HPC gear that emphasizes scalability and low latency interconnect. As a comparison, a billion triple version of the database was created on Amazon Web Services but the performance was roughly 3-5x slower. To create a trillion triple semantic database on AWS would have cost $75k and would have taken 6 weeks to complete.

Monday, July 4, 2011

What would you do with infinite computes?

Firing up a 1000 processor deep analytics cluster in the cloud to solve a market segmentation question regarding your customer orders during Christmas 2010, or a sentiment analysis of your company's facebook fan page now costs less than having lunch in Palo Alto.

The cloud effectively provides infinite computes, and to some degree infinite storage, although the costs of non-ephemeral storage might murk that analogy up a bit. So what would you do differently now you have access to a global supercomputer?

When I pose this question to my clients, it quickly reveals that their business processes are ill-prepared to take advantage of this opportunity. We are roughly half a decade into the cloud revolution, and at least a decade into the 'competing on analytics' mind set, but the typical enterprise IT shop is still unable to make a difference in the cloud.

However, change may be near. Given the state of functionality in software stacks like RightScale and Enstratus we might see a discontinuity in this inability to take advantage of the cloud. These stacks are getting to the point that an IT novice is able to provision complex applications into the cloud. Supported by solid open source provisioning stacks like Eucalyptus and Cloud.com, building reliable and adaptive software service stacks in the cloud is becoming child's play.

What I like about these environment is that they are cloud agnostic. For proper DR/BPC a single cloud provider would be a single point of failure and thus a non-starter. But these tools make it possible to run a live application across multiple cloud vendors thus solving the productivity and agility requirements that come with the territory of an Internet application.

Saturday, November 27, 2010

Why is there so little innovation in cloud hardware?

With the explosion of data and the need to make sense out of it all on a smart phone is creating an interesting opportunity. Mobile devices need high performance at low power, and Apple seems to be the only one that has figured out that having your own processor team and IP is actually a key advantage. And the telcos will need Petascale data centers to manage content, knowledge management, and operational intelligence and the performance per Watt of general purpose CPUs from IBM, Intel, and AMD are at least two order of magnitude away from what is possible. So why is there so little innovation in cloud hardware?

The rule of thumb for creating a new chip venture is $50-75M. Clearly the model where your project is just an aggregation of third party IP blocks is not a very interesting investment as it would create no defendable position in the market place. So from a differentiation point of view early stage chip companies need to have some unique IP. And this IP needs to be substantial. This creates the people and tool cost that makes chip design expensive.

Secondly, to differentiate on performance, power, or space you have to be at least closer to the leading edge. When Intel is at 32nm, don’t pick 90nm as a feasible technology. So mask costs are measured in the millions for products that try to compete in high-value silicon.

Thirdly, it takes at least two product cycles to move the value chain. Dell doesn’t move until it can sell 100k units a month, and ISVs don’t move until there millions of units of installed base. So the source of the $50M-$75M needed for fabless semi is that creating new IP is a $20-25M problem if presented to the market as a chip and it takes two cycles to move the supply chain, and it takes three cycles to move the software.

The market dynamics of IT has created this situation. It used to be the case that the enterprise market drove silicon innovation. However, the enterprise market is now dragging the silicon investment market down. Enterprise hardware and software is no longer the driving force: the innovation is now driven by the consumer market. And that game is played and controlled by the high volume OEMs. Secondly, their cost constraints and margins make delivering IP to these OEMs very unattractive: they hold all the cards and attenuate pricing so that continued engineering innovation is hard to sustain for a startup. Secondly, an OEM is not interested in creating unique IP by a third party: it would deleverage them. So you end up getting only the non-differentiable pieces of technology and a race to the bottom.

Personally however, I believe that there is a third wave of silicon innovation brewing. When I calculate the efficiency that Intel gets out of a square millimeter of silicon and compare that to what is possible I see a thousand fold difference. So, there are tremendous innovation possibilities from an efficiency point of view alone. Combining it with the trend to put intelligence into every widget and connecting them wirelessly provides the application space where efficient silicon that delivers high performance per Watt can really shine AND have a large potential market. Mixed-signal and new processor architectures will be the differentiators and the capital markets will at one point recognize the tremendous opportunities present to create a next generation technology that creates these intelligent platforms.

Until then, us folks that are pushing the envelope will continue to refine our technologies so we can be ready when the capital market catches up with the opportunities.