Tuesday, June 24, 2008

Data is the differentiator

In the previous posts (lingo, federated clouds, smallest Cloud Computer, ) we simplified the notion of cloud computing as a conceptual web interface behind which raw data and computes create value in the form of information. This implies that there are very many different incarnations of cloud computing, such as SaaS, PaaS, Haas, etc. To figure out what the best model of cloud computing is, I believe that understanding the fundamental properties of the raw data will guide you.

The fundamental properties about the data that you need to answer are:

1- security and privacy
2- size
3- location
4- format

Security and Privacy


This should be the starting point since it affects your liability. The current innovators of cloud computing (financial institutions, Google, Amazon) are global organizations with geographically dispersed operations. The business operation of one time zone should be visible to other time zones so these organizations had to solve security and compliance to local privacy laws. Clearly, this has come at a significant cost. However, nascent market for cloud computing resources in the form of Amazon Web Services make it possible for start-ups to play in this new market. These start-ups clearly play a different game and their services tend to have very low security or privacy needs, which allows them to harbor a very disruptive technology. These start-ups will develop low-cost services that will provide powerful competition to EDS and other high-security, high-privacy outsourcers. They will not compete with them directly, and they will expand the market with a lower cost alternative: two prime ingredients for disruptive technology.

Data Size


Data size is the next most important attribute. If your data is large, say a historical snapshot of the World Wide Web itself, you need to store and maintain Petabytes of data. This clearly is a different requirement than if you just want to provide access to a million row OLAP database. Size affects economics and algorithms and it also can complicate the next attribute, location.

Data Location


The location of the data will affect what you can do with it. If the data size is very large, the time or economics of uploading/downloading the data set to a commercial cloud resource provider may be prohibitive. In case of the historical web snapshots, it is much better to generate the data in the cloud itself: that is, the data is created by the compute function you execute in the cloud. For the web index, this would be the set of crawlers that collect the web snapshot. There are readily available AMIs for Hadoop/Lucene/Nutch that enable a modest web indexing service using AWS.

Data Format


The data format affects the details how to use the data. For example, if you have your data in an OLAP database you will need to have that OLAP database running in your process. Similarly, if you have complex data such as product geometry data on which you want to compute stress or vibrational analysis, you will need access to the geometry kernel used to describe the data. Finally, the data format affects the efficiency with which you can access and compute on your data. This is frequently an underestimated aspect of cloud computing but it can have significant economic impact if you pay as you go for storage and computes.

No comments: