High-productivity Cloud computing: June 2008

Friday, June 27, 2008

Grid versus Cloud Computing

From the end user perspective, the short answer to the question "What is the difference between grid computing and cloud computing?" is the way you work with the system is different. Grid Computing follows the typical batch oriented workflow of the old mainframe days. A user has a program to run and a grid allows you to launch this program more or less the same way as if you would on your local machine. The key point is that you look at the grid as a means to execute a program.

The typical use of a cloud is information driven. Assuming Google as the quintessential cloud computing environment, the user is looking for information, and Google's programs have done their job in the past by taking in raw data and organizing it so that the user can find contextual information. Inside Google, scripts are organizing the schedules for launching the programs that crawl the web, compute the index, and update the production index.

I just reread Tom White's post: Running Hadoop MapReduce on Amazon EC2 and Amazon which is a great example of all the steps needed to get a service running on a cloud. Once Hadoop is running and we periodically pick up the web log from S3 we would have a cloud for that particular task. The actual usage case of analyzing a web log would be much simpler when executed on a grid because the grid would automatically start and stop the services needed on our behalf. However, keeping the services running 24/7 and interacting with them through a web interface is more of a cloud computing workflow and that is the way start-ups are using AWS.

Tuesday, June 24, 2008

Data is the differentiator

In the previous posts (lingo, federated clouds, smallest Cloud Computer, ) we simplified the notion of cloud computing as a conceptual web interface behind which raw data and computes create value in the form of information. This implies that there are very many different incarnations of cloud computing, such as SaaS, PaaS, Haas, etc. To figure out what the best model of cloud computing is, I believe that understanding the fundamental properties of the raw data will guide you.

The fundamental properties about the data that you need to answer are:

1- security and privacy
2- size
3- location
4- format

Security and Privacy

This should be the starting point since it affects your liability. The current innovators of cloud computing (financial institutions, Google, Amazon) are global organizations with geographically dispersed operations. The business operation of one time zone should be visible to other time zones so these organizations had to solve security and compliance to local privacy laws. Clearly, this has come at a significant cost. However, nascent market for cloud computing resources in the form of Amazon Web Services make it possible for start-ups to play in this new market. These start-ups clearly play a different game and their services tend to have very low security or privacy needs, which allows them to harbor a very disruptive technology. These start-ups will develop low-cost services that will provide powerful competition to EDS and other high-security, high-privacy outsourcers. They will not compete with them directly, and they will expand the market with a lower cost alternative: two prime ingredients for disruptive technology.

Data Size

Data size is the next most important attribute. If your data is large, say a historical snapshot of the World Wide Web itself, you need to store and maintain Petabytes of data. This clearly is a different requirement than if you just want to provide access to a million row OLAP database. Size affects economics and algorithms and it also can complicate the next attribute, location.

Data Location

The location of the data will affect what you can do with it. If the data size is very large, the time or economics of uploading/downloading the data set to a commercial cloud resource provider may be prohibitive. In case of the historical web snapshots, it is much better to generate the data in the cloud itself: that is, the data is created by the compute function you execute in the cloud. For the web index, this would be the set of crawlers that collect the web snapshot. There are readily available AMIs for Hadoop/Lucene/Nutch that enable a modest web indexing service using AWS.

Data Format

The data format affects the details how to use the data. For example, if you have your data in an OLAP database you will need to have that OLAP database running in your process. Similarly, if you have complex data such as product geometry data on which you want to compute stress or vibrational analysis, you will need access to the geometry kernel used to describe the data. Finally, the data format affects the efficiency with which you can access and compute on your data. This is frequently an underestimated aspect of cloud computing but it can have significant economic impact if you pay as you go for storage and computes.

Friday, June 20, 2008

Web Server as the smallest unit of Cloud Computing

To demystify the concept of cloud computing, I would like to assert that a web server is the smallest unit of functionality that still has all the attributes of cloud computing. Web servers with a little bit of CGI or CFL scripts or Java Applets would constitute convergence of information, universal access by your CAVE, display wall, desktop, laptop, MID (mobile internet device), or smartphone, and a little bit of computing even though the computing may be limited to just page construction.

The data managed by the web server is the source of differentiation. As a user I am looking for valuable or entertaining information. And I am willing to part with money, or time, to find it. This is the driving force behind any business value proposition for cloud computing. Interestingly enough, due to the vast alternatives available, price elasticity is extraordinarely discrete: we are willing to consume indiscriminately if it is free, but if we need to part with money it suddenly becomes a more emotional/rational activity. This explains the popularity of services supported by advertising: human beings are willing to tolerate some degree of SPAM as long as it allows them to consume other information for free.

The consumption of information requires some client device and clearly there are some computes taking place in the client as well. For example, watching a YouTube clip on your smartphone requires some decent performance to decompress and decode the video stream. Universal information convergence is therefore not possible in my mind. The characteristics and usage models of a CAVE are fundamentally different from the characteristics and usage models of a smartphone. There is nothing that can chance that. The clouds that serve up the converged information will therefore have to make a selection of the clients that are appropriate for its information consumption.

What is interesting to me is that the original World Wide Web vision of Sir Timothy John Berners-Lee is effectively cloud computing. Universal access of information among geographically dispersed teams was the impetus to the wold wide web. Driven by business, we are now getting to a vocabulary that places that concept into the consumer space. Continued innovation by businesses to generate and extract value will push more and more computing behind the generation of information, Concurrently, the marketing departments will continue to obfuscate what is fundamentally a very easy to understand and desirable concept: a flat and universally accessable information world.

Thursday, June 19, 2008

Federated Clouds

Google's CEO Eric Schmidt defines cloud computing as the convergence of information. The cloud aggregates raw data, organizes the data so that it becomes valuable as information, and the information is accessable from any device anywhere. The organization of the data to transform it into information is where the "computing" term is justified in Cloud Computing.

So in a nutshell, clouds organize data to create value in terms of targeted information that can be sold or auctioned, such as advertising keywords, or the best price on an airline ticket. However, information truly is unbounded, and thus there will be many specialized clouds to add very specific value to specific raw data sets. For example, the data maintained in Amazon's cloud and its computational processes to create a searchable book store are very different compared to Google's Web Service or NASDAQ's Data Store.

Given the fact that not even Google can encapsulate all knowledge, diversity of information will lead to a commercially motivated federated system of clouds. Each cloud has its own optimized data organization capability to generate valuable information for profit. Amazon and Google will have opportunities for innovation that nobody else has due to their scale, but peripheral innovation will occur through aggregation, or mashups, of data residing in different clouds, thus creating a federation of clouds.

As far as the starting premise of what makes a cloud a cloud: convergence of information, the federation of clouds continues this premise, and thus can be seen again as a cloud.

Riches for SaaS providers

Many consultancy business are built upon serving the need for comparitive benchmarking of business operations. Ranging from strategy and efficiency, to more detailed metrics such as employee retention and productivity. Most of these consultancies use old-fashioned surveys and questionaires to gather this data which of course is fraught with data quality issues. Well, no more! There is a better way and it is through SaaS providers.

To generate the necessary economies of scale, SaaS is by necessity multi-tenant. Secondly, the dynamics of business haven't changed so SaaS providers need to race to critical mass in terms of installed customer base to continue to be relevant and generate free cash flow to drive continued innovation and roll out new functionality. Only the largest SaaS providers in a vertical will survive.

This creates the next monetization option for SaaS providers: business intelligence and benchmarking. The SaaS provider has brought together a wealth of companies all using the same business process codified in the SaaS functionality. Mining this data for trends and operational business metrics is just a small step. Global competition forced big companies to develop these business intelligence processes and they had the operational scale necessary. Small and medium business operation had to be aggregated to be able to generate this opportunity and this will be many times over more valuable than the SaaS functionality that the SaaS provider started with.

Wednesday, June 18, 2008

Cloud Computing Lingo

To understand all the marketing information that bear the label "cloud computing" it is helpful to have a quick glossary of terms.

Grid Computing

Grid Computing is a collaboration model. Locally managed resources are virtualized and aggregated in a larger, more capable resource. Grid computing is concerned with coordinating problem solving in virtual organizations and typically is associated with large and complex "Grand Challenge" problems.

In the scientific community, multi-institutional collaboration is required to have any hope of solving fundamental questions as they arise in high-energy physics, fusion or climate research. It is in this community that the world wide web originated to fulfill the need for seamless document access among geographically dispersed team members, and it is also the birthplace of the grid. In 1990 the first HTML communication took place at CERN, and in 1994 the first grid was put together around the Supercomputing conference as a mechanism for all participants to share data and models. It was dubbed I-WAY at that time but it was the starting point of research to try to find solutions for security, resource management, job control, and data caching that are central to grid computing.

Examples are: Terra GRID, Euro GRID

Haas, or Hosting as a Service

HaaS is a business service model. There are a lot of activities in modern business that are not core operational differentiators. These essential but peripheral services are better outsourced to specialists who can leverage economies of scale. Payroll management, shipping, and web presence are three examples of services that tend to be outsourced for most modern businesss, particularly small and medium sized businesses (SMBs).

Examples are: Startlogic, Hostmonster, Rackspace

SaaS, or Software as a Service

SaaS is a software deployment model. Application functionality is provided to the user through a web interface and the SaaS provider manages hardware and software operation and maintenance.

Examples: Salesforce.com, Webex, Netsuite

SaaS is generally associated with business software and marketed as a service to lower the cost of internally managed software. SaaS allows customers to lower the initial cost of software licenses and computer hardware to run on.

Web Site Hosting and Web Application Hosting Services are probably the most ubiquitous instances of the SaaS model
Customer Resource Management, or CRM, has many different instances, for example Salesforce.com, Siebel, or Coghead
Completely integrated Enterprise Resource Management systems are provided by SAP, Oracle, Netsuite, Epicor, or Infor

Storage-as-a-Service and Computing-as-a-Service are slightly different in nature compared to SaaS since the latter provides application functionality whereas in the former two, access to a resource is sold or rented.

Commercially, SaaS has carved out many different useful services. Unfortunately, this has lead to a fragmentation of the market with the associated interoperability and economic lock-in problems. When selecting a SaaS provider the overriding question should be if you can move your data to other providers or bring it in house. SaaS becomes less interesting at a larger scale or if you want to extract business intelligence from your data. Plan for success but manage for failure. If the SaaS provider does not have a productive mechanism to get all the data out of the service, think twice before signing a contract.

PaaS, or Platform as a Service

PaaS is a software life-cycle model. Applications are developed, tested, deployed, hosted, and maintained on the same integrated platform.

Examples: Bungee Lab Connect, Comrange AppProducer

Web Services

Web Services represent anything that serves data, information, or access through a web browser. This is such a nebulous group of functionality that this term is more confusing then it is helpful. For example, Amazon's EC-2 is billed as Amazon Web Services, but really AWS rents you an appliance on which you can install your own machine image. That appliance can now do anything, from web site serving, to application serving, to data mining, to web indexing, to running your OpenOffice spreadsheet model. Google web services aggregate anything from email to calendaring to picture storage and of course web indexing, but it is distinctly different from Amazon's web services.

Web 2.0

Web 2.0 refers to the proposed second generation of Internet-based services where data services such as social networks, blogs, and wiki's are connected and add value to each other. Collaboration is central to this model and users generate information and police themselves.

Welcome

Welcome to High-productivity Cloud Computing.

Cloud Computing has as many interpretation as there are users, but there is one common thread among all cloud computing models and that is the convergence of information access. The 'Cloud' holds your information, the 'Cloud' may even compute on your information, and for real value, the 'Cloud' may combine other sources of information to make your data, or information, more valuable.

This blog takes the distinct position to reason about clouds from the user's productivity perspective. Enjoy.

High-productivity Cloud computing