The "Cloud": Data Warehousing in 2015 (Part 1)


Let’s take a look this week at the benefits of external hosting of our data warehouse.

With prices so affordable from well-known providers like Amazon (AWS), Microsoft (Azure), and Google, there is no business reason to host a data warehouse internally. All three refer to this process by using the phrase “Cloud Computing”.  No offense to the soft and hardware marketing professionals reading this post, but I really think the noun/verb combination “Cloud Computing” is an overloaded phrase. Data warehousing is an object-oriented programming term, and the two cannot be compared.  Not to date myself, but thirty years ago we had a process called “time sharing services”.  This too was available from various vendors.  These services allowed us to run several types of statistical simulations/business analytics.  Cloud Computing is nothing more than what we did thirty years ago, yet on a much larger scale.

Data in a data warehouse is managed in a columnar format based on some kind of key, (unique or non-unique).  This enables analysis to be done in de-normalized rows created from a fact table.  In the world of mainframe days this was called an inverted list.  Today the cost of doing this exact same thing is geometrically lower.

Google, AWS and Azure all offer similar partners providing SAAS in the same variety of business categories, however each has its area of specialization. All leverage their extensive data networks and processing capacities on a worldwide scale.  Years ago I worked at a co-location center in Dallas that was attached to a major telecom provider.  I was installing hardware in a rack one Friday and noticed a large cage and racks being installed next to our location.  Returning the following Monday I found roughly 5,000 servers placed in that new rack space, all humming away.  It was Google installing a regional center for web searching capacity.  I thought that was an enormous economy of scale in 2006. Imagine what it is in 2015!

Google has the edge in web metrics.  Any metric about a web site or usage, user, business, demographic or anything else imaginable about a web site; Google has it remembered.  Not only remembered, but codified and classified as well.    The only drawback from my perspective is that their tools don’t seem user friendly.  Google also has a unique “what if analysis” for digital marketing which neither of the others seem to address.

AWS has created Redshift.  This first-rate product has an excellent architecture built to obtain extremely high query performance on datasets.  These datasets can range from a few hundred gigabytes to a petabyte or more. It uses columnar storage, data compression, and zone maps to reduce the amount of I/O needed to perform queries.  It also uses parallel processing in its data warehouse architecture, parallelizing and distributing SQL operations to take advantage of all available resources.  Costs are minimal, changing frequently depending upon the competition, but pricing is very affordable.  AWS also competes well in the SAAS market and functions parallel to those options provided by the others.

Azure is the newest participant in the threesome of data warehouse providers. Not only are they becoming competitive in the market, they are the logical choice if you are a Microsoft shop.   One large transition issue Microsoft data warehouse shops usually face is in changing from SQL 2008 R2 to SQL 2012 or 2014. Another consideration is that Azure pricing and configuration is somewhat confusing and their customer service tools may add to the confusion.

All of these services are more than capable of solving a data-warehousing requirement.  It’s just a matter of which one meets the needs of your business. In the following weeks I’ll guide you through implementation of a data warehouse in each of the above vendors and provide specific examples and output.