The "Cloud": Data Warehousing in 2015 (Part 2)

AUTHORED BY DONALD C. GILLETTE, PH.D., DATA CONSULTANT @ GUIDEIT

This week we will explore, in my opinion the best BI product currently on the market; Redshift by Amazon Web Services (AWS).  

Amazon Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes. It uses standard PostgreSQL JDBC and ODBC drivers, allowing you to use a wide range of familiar SQL clients. Data load speed scales linearly with cluster size, with integrations to Amazon S3, Amazon DynamoDB, Amazon Elastic MapReduce, Amazon Kinesis or any SSH-enabled host.

Redshift’s data warehouse architecture allows the user to automate most of the common administrative tasks associated with provisioning, configuring and monitoring a cloud data warehouse. Backups to Amazon S3 are continuous, incremental and automatic. Restores are fast! You are able to start querying in minutes while your data is spooled down in the background. Enabling disaster recovery across regions takes just a few clicks.

Security is built-in. Redshift enables you to encrypt data at rest and in transit (using hardware-accelerated AES-256 and SSL) isolate your clusters using Amazon VPC, and even manage your keys using hardware security modules (HSMs). All API calls, connection attempts, queries and changes to the cluster are logged and auditable.

Redshift uses a variety of innovations to obtain the highest query performance on datasets ranging in size from a hundred gigabytes to a petabyte or more. It uses columnar storage, data compression, and zone maps to reduce the amount of I/O needed to perform queries. It has a massively parallel processing (MPP) data warehouse architecture. Parallelizing and distributing SQL operations, it takes advantage of all available resources. The underlying hardware is designed for high performance data processing, using local attached storage to maximize throughput between the CPUs and drives, and a 10GigE mesh network to maximize throughput between nodes.

With just a few clicks of the AWS Management Console or a simple API call, you can easily change the number or type of nodes in your cloud data warehouse as your performance or capacity needs change. Amazon Redshift enables you to start with as little as a single 160GB DW2 Large node and scale up all the way to a petabyte or more of compressed user data using 16TB DW1 8XLarge nodes. 

While resizing, it places your existing cluster into read-only mode, provisions a new cluster of your chosen size, and then copies data from your old cluster to your new one in parallel. You can continue running queries against your old cluster while the new one is being provisioned. Once your data has been copied to your new cluster, Redshift will automatically redirect queries to your new cluster and remove the old cluster.  

Redshift allows you to choose On-Demand pricing with no up-front costs or long-term commitments, paying only for the resources you provisions. You can obtain significantly discounted rates with Reserved Instance pricing. With affordable pricing that provides options, you’re able to pick the best scenario to meet your needs. 

Stay tuned for part  3 next week. In the meantime, what's your view on Redshift or other tools? Any challenges or projects you want to discuss?