High-productivity Cloud computing

Thursday, September 6, 2018

What makes posits so good?

Computational science and all its applied forms, statistics, Deep Learning, computational finance, computational biology, etc., depend on a discrete model of equations that describe the system under study and some operator to extract information from that model. The vast majority of these operators are expressed in linear algebra form, and the dominant sub-operation in these operators is a dot product between two vectors. With the advent of big data and Deep Learning, tensor expressions have become wide-spread, but the basic operations remains squarely the dot product, as can be observed by the key operator that DL accelerators like the Google TPU implement in hardware.

Let's take a closer look at the numerical properties of that dot product. Let:

vector a = ( 3.2e8, 1.0, -1.0, 8.0e7)
vector b = ( 4.0e7, 1.0, -1.0, -1.6e8)
what would be the value of dot(a, b)? Simple inspection confirms that a[0]b[0] cancels a[3]b[3] and thus the answer is 2.

Interestingly enough, neither 32-bit floats nor 64-bit double IEEE floating point produce the correct answer when calculating this dot product in order. In contrast, even a very small posit, a 16-bit posit with 2 exponent bits, is able to produce the correct result. What gives?

The individual element values of the vector are relatively ordinary numbers. But in dot products it is the dynamic range of products that the number system needs to capture, and this is where IEEE floating point fails us. IEEE floating point rounds after each product and thus the order of the computation can change the result. This is particularly troubling for concurrent systems, such as multi-core and many-core, where the programmer typically gives up control.

In our particular example, the products a[0]b[0] and a[3]b[3] represent 1.28e16 and -1.28e16 respectively, and to be able to represent the sum of 1.28e16 + 1 requires 53bits of fraction, which 64-bit IEEE doubles do not have, creating cancellation and in the end an incorrect sum.

Posits, on the other hand, have access to a fused dot product, which uses a quire, which can be thought of as a super accumulator that enables the computation to defer rounding till the end of the dot product. This makes it possible for a 16-bit posit to beat a 64-bit IEEE double.,

In general, posits with their finer control over precision and dynamic range, present an improvement over IEEE floating point for computational science applications. In particular, business intelligence and decision support systems benefit from posits and their quires, as statistics algorithms tend to aggregate lots of data to render an assessment on which a decision depends. As the example above demonstrates, that assessment can be very wrong when using the wrong approach.

Wednesday, September 5, 2018

Posit Number System: replacing IEEE floating point

The leaders in computing are now profiting from their investments in new number systems initiated half a decade ago. NVIDIA transformed the AI ecosystem with their 16-bit, half-precision float providing a boost over Intel enabling them to gain market share in the valuable data center market. Google designed the Tensor Processing Unit to accelerate AI cloud services; the TPU uses an 8-bit integer format to create a 100x benefit over its competitors. Microsoft is using an 8-bit floating point with 2-bit exponents for its Project Brainwave. And China’s Huawei is using a fixed-point format for its 4G/5G base stations to gain performance per Watt benefit over its US competitors who still use IEEE floating point.

All these companies realized that Moore’s Law and Denning’s scaling having reached a plateau, and the efficiency of computation is now a direct limiter on performance, scaling, and power. For Deep Learning specifically, and High-Performance Computing in general, IEEE floating point has shown its deficiencies in silicon efficiency, information density, and even mathematical accuracy.

The posit number system is positioned as a replacement of IEEE floating point, and offer significant improvements including performance per Watt, information density, and reproducibility. The posit number system is a tapered floating point with very efficient encoding of real numbers. It has only two exceptional values; zero and NaR (not-a-real). The posit encoding improves precision compared to floats of the same bit-width, which leads to higher performance and lower cost for big-data applications. Furthermore, the posit standard defines rules for reproducibility in concurrent environments enabling high-productivity and lower-cost for software application development for multi-core and many-core deployments.

In the following blog posts, we will introduce the posit number system format, and report on experiments that compare IEEE floats to posits in real applications. Here is a reference to a full software environment for you tinkerers: http://stillwater-sc.com/assets/content/stillwater-universal-sw.html

Friday, January 25, 2013

On-Demand Supercomputing

Except for the folks at Cray, most people are unaware of the unique requirements that set apart supercomputing infrastructure from cloud computing infrastructure. In its simplest form the difference is between latency and capacity. For business intelligence applications such as optimization and logistics many servers are required to solve a single problem, and low latency communication between the servers is instrumental for performance. The intuition behind this is easy to understand: a modern microprocessor executes 4-5 instructions per 250ps, and thus packet latencies of 10GbE, (between 5-50usec), are roughly equivalent to 100k to 1M processor instructions. If a processor is dependent on the results computed by another processor, it will have to idle till the data is available. Cumulatively, across a couple hundred servers, this can lead to peak performance that is only 1-5% of peak.

Supercomputing applications are defined by these types of tightly connected concurrent processes, putting more emphasis on the performance of the interconnect, in particularly the latency. Running a traditional supercomputing application on an infrastructure designed for elastic applications, such as AWS or Azure, typically yield slow-downs by a factor 50 to 100. Measured in terms of cost, they would cost 50-100 times more to execute on a typical public cloud computing infrastructure.

Most supercomputing applications are associated with very valuable economic activities of the business. As mentioned earlier, production optimization and logistics applications save companies like Exxon Mobil and Fedex billions of dollars per year. Those applications are tightly integrated in the business operation and strategic decision making of these organizations and pay for themselves many times over. However, for the SMB market these supercomputing applications offer great opportunity for revenue growth and margin improvements as well. However, their economic value is attenuated by the revenue stream they optimize; 10% improvement for a $10B revenue stream yields a $1B net benefit, but for a $10M revenue stream the benefit is just a $1M, not enough to compensate for the risk and cost that deploying a supercomputer would require.

Enter On-Demand Supercomputing.

In 2011, we were asked to design, construct, and deploy an On-Demand supercomputing service for a Chinese cloud vendor. The idea was to build an interconnected set of supercomputer centers in China, and offer a multi-tenant on-demand service for high-value, high-touch applications, such as logistics, digital content creation, and engineering design and optimization. The pilot program consisted of a supercomputer center in Beijing and one in Shanghai. The basic building block that was designed was a quad rack, redundant QDR IB fat-tree architecture with blade chassis at the leaves. The architecture was inspired by the observation that for the SMB market, the granularity of deployment would fall in the range of 16 to 32 processors, which would be serviced by a single chassis, keeping all communication traffic local to the chassis. The topology is shown in the following figure:

Redundant QDR IB Network Topology for On-Demand Supercomputing

The chassis structure is easy to spot as the clusters of 20 servers at the leaves of the tree. The redundancy of the IB network is also clearly visible by the pairs of connections between all the layers in the tree. The quad configuration is a two rack symmetric setup, one pair holding one side of the redundant IB network/storage/computes. So half the quad can fall away, and the system would still have full connectivity between storage and computes. To lower the cost of the system, storage was designed around IB-based storage servers that plugged into the same infrastructure as the compute nodes. QDR throughput is balanced with PCIe gen2 and thus we were able to deliver ephemeral blades that get their personality from the storage servers and then dynamically connect via iSCSI services to whatever storage volumes they require. This is less expensive than designing a separate NAS storage subsystem, and it gives the infrastructure flexibility to build high-performance storage solutions. It was this system that set a new world record by being the first trillion triple semantic database system leveraging a Lustre file system consisting of 8 storage servers (trillion-triple-semantic-database-record).

The provisioning of on-demand supercomputing infrastructure is bare metal, mostly to avoid any of the I/O latency degradation that virtualization injects. Given the symmetry between storage and compute and the performance offered by QDR IB, a network boot mechanism can be used to put any personality on the blades without any impact on performance. The blades have local disk for scratch space, but run their OS and data volumes off the storage servers, thus avoiding the problem of DR of state on the blades.

The QDR IB infrastructure was based on Voltair switches and Mellanox HCAs. Intel helped us tune the infrastructure, using their cluster libraries for the processors we were using, and Mellanox was instrumental in getting the IB switches in shape. Over a three week period, we went from 60% efficiency to about 94% efficiency. The full quad has a peak performance of 19.2TFlops and after tuning the infrastructure we were able to consistently deliver 18TFlops of sustained performance.

The total cost of the core system was of the order of $3.6M. The On-Demand Supercomputing service offers a full dual socket server with 64GB of memory for about $5/hr, providing a cost-effective service for SMBs interested in leveraging high performance computing. For example, a digital content creation firm in Beijing leveraged about 100 servers as burst capacity for post-production. Their monthly cost to leverage a state of the art supercomputer was less than $20k per month. Similarly, a material science application was developed by a chemical manufacturer to study epitaxial growth. This allowed the manufacturer to optimize the process parameters for a thin-film process that would not have been cost-effective on a cloud infrastructure designed for elastic web applications.

The take-away of this project is echoing the findings in the missing middle reports for digital manufacturing (Digital Manufacturing Report). There is tremendous opportunity for SMBs to improve business operations by leveraging the same techniques as their enterprise brethren. But the cost of commercial software for HPC is not consistent with the value provided for SMBs. Furthermore, the IT and operational skills required both to setup and manage a supercomputing infrastructure is beyond the capabilities of most SMBs. On-demand HPC services, as we have demonstrated with the supers in Beijing and Shanghai, can overcome many of these issues. Most importantly, it enables a new level of innovation by domain experts, such as professors and independent consultants, who do have the skills necessary to leverage supercomputing techniques, but up to now have not had access to public supercomputing capability and services.

Thursday, October 4, 2012

Cloud Collaboration: Bioinformatics

In this multi-part series, we discuss several cloud computing projects that we have completed over the course of the past 12 months. They range from experiments in computational science and engineering to full production systems for cloud service providers. This week, we'll discuss a cloud collaboration solution that was designed to bring together collaborators at different bioinformatics labs, and to connect them to an elastic cloud computing facility that offers FPGA accelerated servers, which is a facility that is increasingly important to solve the exponentially increasing compute load for bioinformatics research.

NEXT GENERATION BIOINFORMATICS

The importance of next generation sequencers (NGS) to personalized healthcare opportunities is well documented [1]. Conventional medical diagnosis relies on the analysis of the patient's personal and family history, symptoms, and samples. The goal of personalized health care is to strengthen the diagnostics by including comparisons between the patient's genome and a global reference database of known disease markers. Sample testing will also be enhanced through highly sensitive measurement of adaptive immune system parameters, and molecular-level monitoring of the progression of different cancers. Outside human health, NGS will touch every aspect of our understanding of the living world, and help improve food safety, control pests, reduce pollution, find alternate sources of energy, etc.

At the lowest level, the bioinformatic questions fall into two broad classes: (i) comparing next-gen reads against reference genomes, and (ii) assembling next-gen sequences to build reference genomes. Finding scaleable and cost-effective solutions to these questions is a major challenge. Even the best algorithms need large amounts of fast and costly computer memory (50-1024GB), and thousands of processor-hours to complete analysis on mammalian-sized genomes. Reducing execution time will improve the quality of results by allowing the analysis to be run with many different parameters.

The computational landscape is changing rapidly with discovery of new algorithms, and the proliferation of reference genomes. This puts tremendous pressure on the data management and computational complexity for which a seamless solution needs to be found.

Our client was a start-up that was designing a new instrument. Our client's research instrument was able to generate up to 100GBytes of data per day, and would run continuously for several days for a single study. The raw data had to go through several stages of deep analytics to identify and correct data acquisition errors, apply data reduction and reconstruction operators, and keep track of historical performance to identify the best algorithms and to manage the research direction and software development. The algorithms required high throughput compute clusters, and the R&D talent that developed these algorithms was geographically dispersed throughout the US. The management and procurement of these clusters was beyond the capability or capital resources of this start-up, and on-demand cloud services could provide solutions to this CAPEX problem. However, to integrate multiple remote laboratories into a productive collaborate space required a robust and cost-effective solution to file replication so that each instrument, and all subsequent analysis results, would be readily available to the distributed team.

NEXT GENERATION CLOUD INFRASTRUCTURE

The core operation of the instrument was based on data acquisition of high-frequency RF signals and the use of chemicals to prepare samples. This nature demanded that the physical machine resided in a well-controlled laboratory environment, a very different environment than a typical data center room. The compute demands of the research algorithms were roughly in the 50TOPS range and highly latency sensitive. The cost to create that raw hardware capacity was of the order of $150k per lab, not including operational staff. The utilization of that equipment would have been very low, particularly in the beginning phases of the project when the instrument would only generate data once per month. Allocating the compute capacity for the research algorithms in a remote cloud solves the capex and utilization problem, but we would introduce a data movement problem. What did this trade-off look like?

We evaluated two organizations:

instrument and compute cluster per lab
instrument per lab, compute cluster in a remote cloud

The data retention was not a key attribute, and thus the cost of data storage for backups was not a factor in the design. However, link bandwidth was a key attribute. The labs were at the end of relatively low bandwidth links and increasing bandwidth required a 90 day waiting period with hard caps due to infrastructure limits. The links in place were limited to 5Mb/s!!! Not very impressive, but surprisingly common. Increasing the bandwidth would have cost $30k/year extra and the 90 days waiting also made this unattractive.

The capex of option 1 with two labs was about $300k with about 4 weeks of turn around time. That capex would go away in option 2, and the turn around time was reduced to days to get up and running. However, at 5Mb/s, moving a 50GB file to a remote cloud would take several days, and worse, this cost would have to be paid every data acquisition. However, the early research instrument would take several days for a data collection, so the labs workflow was already used to having to a long latency between experiment design and data collection. But, more importantly, if the instrument takes several days to collect a large data set, if we need to migrate that data to a remote location, we want to overlap data acquisition with data transfer. Typical web application protocols don't work well in this regard, so HTTP and FTP are not attractive. The idea for the solution came from our use of Git. Git is a versioning system that is based on snapshotting a file system: that model is exactly the right technology as it is seamless and robust. This snapshotting idea lead us to Nasuni, which provides a filer that snapshots to Amazon S3 and integrates this with additional enterprise features such as volume management and volume replication configuration. Nasuni is relatively expensive, starting at $10k/TB/year, but the flexibility and set-and-forget feature set made it very attractive. The question was whether or not file system snapshotting would work with very low bandwidth links. If the replication performance was sufficient, then managing the persistent data volume that would determine cost would be trivial.

SYSTEM ARCHITECTURE

To create a robust and reliable next generation IT infrastructure, we designed and implemented a real-time distributed collaboration grid, as depicted in the figure below.

Each lab is allocated a Nasuni filer, with a sufficiently large file cache to support local working set caching. The Nasuni service creates a system where each filer becomes part of a distributed file system that is created by each filer replicating and receiving snapshots to and from the cloud storage service. The filers encrypt the snapshots and send them to the cloud store filer, which replicates the snapshots at the other locations. The cloud store lives on Amazon Web Services as an encrypted object store to provide high security and high availability. Given normal AWS S3 replication parameters, this store represents an almost perfect reliability platform. Nasuni guarantees 100% availability, which derives from the eleven-nines that Amazon offers on its S3 storage platform. (Amazon uses the term 'durable', which is more specific, as availability also incorporates connectivity, which Amazon can't guarantee)

The filers can be configured to snapshot and replicate in a real-time fashion for small delta data changes, such as scientist work spaces. For data sets that see a lot of change, such as the output of data collection instruments, it is more appropriate to snapshot at lower frequencies to create good traffic attributes during replication. Furthermore, since our client's data sets are generated over multiple days, it is not useful to replicate these data sets in real-time as the data is not complete and the overhead is too high.

We also allocated a filer at the high performance computing cloud provider. The role of this filer is different from the filers at the laboratories. Whereas the filers at the laboratories function as a buffer for the data generation instruments, the filer at the compute cloud functions as an aggregation and working set manager. At the compute cloud, bioinformatic applications are run that use all, or large subsets of the data, and produce large, new sets of results. These results need to be replicated to the laboratory filers. The aggregation data can be held in a big data store, such as Hadoop or Riak, so that the CSP filer snapshot cache is not over-committed during these deep analytics compute cycles.

Conclusions

To test the data replication performance, we used a 60GB data set that was generated by the instrument over a two day period. We used a volume snapshot rate of three times a day, and we throttled the filers bandwidth to only 4Mb/s so that we would not severely impact normal Internet traffic. At our Cloud Service Provider, we ran a VMware-hosted filer as the target. The Nasuni filer received our data set in roughly 35 hours. This experiment demonstrated that even with a high volume instrument the data movement delay to a CSP was not a significant source of delay. When the instrument would improve it data acquisition rates the link rates could be improved to match this increased performance.

Once we had proven that data movement using the Nasuni filers and storage service had reasonable performance even on very low bandwidth links, the additional features provided by the Nasuni storage service made this solution very attractive. IT personal can configure the filers and the volume parameters, and manage these against the IT resources available, independent from the research teams. Those research teams are presented with effectively an infinite storage resource and complete flexibility to compute locally, or in the cloud, all the while knowing that whatever results they produce, the data is visible to all collaborators without having to manually engage upload/download workflows.

The one attribute that is of concern with this cloud collaboration architecture is the cost of storage. At higher storage capacity the Nasuni service drops down in cost significantly, but the assumption behind the Nasuni service is that all data is enterprise critical. In research oriented environments, data sets tend to grow rapidly and unbounded, but none of it is enterprise critical. To not break the bank, we must introduce active storage management. Once data has been analyzed and transformed, it should be copied to local NAS filers, and removed from the Nasuni service. This provides a solution that controls cost, and as a positive side effect, it keeps the data sets actively managed.

The end result of this project was that we saved the client $300k in capex and the overhead of managing two hardware clusters. The clusters would have been essential for creating the value-proposition of the client's innovation, but once the core analytics algorithm was developed, would have gone unused after the first year. The high data expansion volume of the instrument made a cloud computing solution non-trivial, but using the Nasuni storage service provided a set-and-forget solution to the data distribution problem.

References

A. Kahvejian, J. Quackenbush, and J.F. Thompson, What would you do if you could sequence everything?, Nature Biotechnology, 2008, Vol. 26, pp 1125-1133, http://www.nature.com/nbt/journal/v26/n10/full/nbt1494.html

Wednesday, October 3, 2012

Cloud Collaboration

The past 12 months, we have implemented a handful of global cloud platforms that connect US, EU, and APAC. The common impetus behind these projects is to connect brain trusts in these geographies. Whether they are supply chains in Asia program managed from the EU, healthcare cost improvements in the US by using radiologists in India, or high-tech design teams that are collaborating on a new car or smart phone design, all these efforts are trying to implement the IT platform to create the global village.

The teachings provided by these implementations are that cloud computing is more or less a solved problem, but cloud collaboration is far from done. Cloud collaboration from an architecture point of view is similar to the constraints faced by mobile application platforms, so there is no doubt that in the next couple of years we'll see lots of nascent solutions to the fundamental problem of mobility and cloud collaboration: data movement.

The data sets in our US-China project measured in the range from tens to hundreds of TBytes, but data expansion was modest at a couple of GBytes a day. For a medical cloud computing project, the data set was more modest at 35TBytes, but the data expansion of these data sets could be as high as 100GB per day, fueled by high volume instruments, such as MRI or NGS machines. In the US-China collaboration, the problem was network latency and packet loss, whereas in the medical cloud computing project, the problem was how to deal with multi-site high-volume data expansions. The cloud computing aspect of all these projects was literally less than a couple of man weeks worth of work. The cloud collaboration aspect of these projects all required completely new technology developments.

In the next few weeks, I'll describe the different projects, their business requirements, their IT architecture manifestation, and the key technologies that we had to develop to deliver their business value.

Thursday, September 29, 2011

Amazon Silk: split browser architecture

Amazon Silk

Content Delivery Networks, and WAN optimization, provided a generic acceleration solution to get common content closer to the client device, but on mobile devices the delivery performance of the last mile was still a problem. Many websites still do not have mobile optimized content, and sucking down a 3Mpixel JPG and render it on a 320x240 pixel display is just plain wrong. With the introduction of Amazon Silk, which uses the cloud to aggregate, cache, precompile, and predict, the client-side experience can now be optimized for the device that everybody glamours for: the tablet.

This is going to create an even bigger disconnect between the consumer IT experience and the enterprise IT experience. On the Amazon Fire you will be able to pull up, nearly instantaneously, common TV video clips and connect to millions of books. But most enterprises will find it difficult to invest in WAN optimization gear that would replicate that experience on the corporate network for your day to day work.

Amazon Silk is another example of the power that the cloud provides for doing heavy computes and caching that enables low-capability devices to roam.

Wednesday, September 14, 2011

Trillion Triple Semantic Database

The Semantic Web captures the semantics, or meaning, of data, and machines are enabled to interact with that meta data. It is an idea of WWW pioneer Tim Berners-Lee who observed that although search engines index much of the Web's content, keywords can only provide an indirect association to the meaning of the article's content. He foresees a number of ways in which developers and authors can create and use the semantic web to help context-understanding programs to better serve knowledge discovery.

Tim Berners-Lee originally expressed the vision of the Semantic Web as follows:

I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The intelligent agents people have touted for ages will finally materialize.

The world of semantic databases just got a little bit more interesting with the announcement by Franz, Inc. and Stillwater SC of having reached a trillion triple semantic data store for telecommunication data.

http://www.franz.com/about/press_room/trillion-triples.lhtml

The database was constructed with an HPC on-demand cloud service and occupied 8 compute servers and 8 storage servers. The compute servers contained dual socket Xeons with 64GB of memory connecting through an QDR IB network to a 300TB SAN. The trillion triple data set spanned roughly 100TB of storage. It took roughly two weeks to load the data, but after that database provided interactive query rates for knowledge discovery and data mining.

The gear on which this result was produced is traditional HPC gear that emphasizes scalability and low latency interconnect. As a comparison, a billion triple version of the database was created on Amazon Web Services but the performance was roughly 3-5x slower. To create a trillion triple semantic database on AWS would have cost $75k and would have taken 6 weeks to complete.