Monday, April 6, 2015

Amazon Redshift Architecture and Pros and Cons

Here, I'd like to share my observations on Amazon Redshift, Architecture, Pros&Cons, Comparisons with on-premise DWH etc. 

One of the Good Sales Offer that grabbed the Industry attention is "AZ Redshift" from Amazon.

Amazon Redshift is cloud based fast, fully managed, petabyte-scale Data Warehouse service by Amazon Web Services (AWS). This is cost-effective and efficient solution to collect, store (ETL) and analyze all your data using your existing business intelligence tools.

Redshift provides a powerful and cost effective alternative to the traditional on premise data warehouses.  

Similar the data warehousing on cloud has been offered by companies like 1010data, Data Meer, and Vertica. But Amazon’s basket of cloud and big data services make it a lucrative platform for enterprise of any size.

Amazon Redshift is a relational database management system (RDBMS) on cloud and it provides the same functionality as other RDBMS, such as inserting and deleting data, Amazon Redshift is optimized for high-performance analysis and reporting of very large datasets.

Redshift is very impressive feature-wise for a 1.0 product. There is much in Redshift's favor as an alternative to on-premises data warehouses, especially if we can understand distinctive characteristics of the service and which we can use to our advantage.

Architecture:  The components involved in Redshift are Client Applications, Connections, Clusters (Leader Node, Compute Nodes, Node slices/processors, Internal Network and Databases.




Client Applications:  Amazon Redshift integrates with various data loading and ETL (extract, transform, and load) tools and business intelligence (BI) reporting, data mining, and analytics tools. Amazon Redshift is based on industry-standard PostgreSQL, works with most existing SQL client applications.

Connections: Amazon Redshift communicates with client applications by using industry-standard PostgreSQL JDBC and ODBC drivers.

Clusters: The core infrastructure component of an Amazon Redshift data warehouse is a cluster. A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more compute nodes, an additional leader node coordinates the compute nodes and handles external communication.
Your client application interacts directly only with the leader node. The compute nodes are transparent to external applications.

Leader Node: The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations. Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node.
The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes. All other queries run exclusively on the leader node. Amazon Redshift is designed to implement certain SQL functions only on the leader node. A query that uses any of these functions will return an error if it references tables that reside on the compute nodes.

Compute Nodes: The compute nodes execute the compiled code and send intermediate results back to the leader node for final aggregation.
Each compute node has its own dedicated CPU, memory, and attached disk storage, which are determined by the node type. As your workload grows, you can increase the compute capacity and storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.
Amazon Redshift provides two node types; dense storage nodes and dense compute nodes. Each node provides two storage choices. You can start with a single 160 GB node and scale up to multiple 16 TB nodes to support a petabyte of data or more.

Node slices:  A compute node is partitioned into slices; one slice for each core of the node's multi-core processor. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation.
When you create a table, you can optionally specify one column as the distribution key. When the table is loaded with data, the rows are distributed to the node slices according to the distribution key that is defined for a table. Choosing a good distribution key enables Amazon Redshift to use parallel processing to load data and execute queries efficiently.

Internal Network: Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and custom communication protocols to provide private, very high-speed network communication between the leader node and compute nodes. The compute nodes run on a separate, isolated network that client applications never access directly.

Databases: A cluster contains one or more databases. User data is stored on the compute nodes. Your SQL client communicates with the leader node, which in turn coordinates query execution with the compute nodes. Amazon Redshift is optimized for high-performance analysis and reporting of very large datasets.
Note: Internal System Operation
Query processing - Client requests and execution - Explain plan - Query execution steps

AZRS can be recommended in the following scenario:

If the company objectives are i) Saving cost ii) Avoid maintenance iii) Scalable iv) Performance

Amazon Redshift might make more sense for the following customers:
·         Web startups, whose data is all in the Amazon cloud anyway, and who need better analytic SQL performance than they can get from Hadoop.
·         Data mart outsourcers/data sellers, again probably startups, whose whole business is in the cloud.
·         Individual analysts with small budgets, or very small analytic groups within enterprises or other organizations.
·         Many businesses who don’t have I.T. as a core competency will find this a pretty attractive alternative to running servers or an appliance themselves.

Cannot be recommended in the following scenarios:
         i.            Security is main concern (though it seems to be secure over the cloud)
       ii.            Data uploading concerns (ex: If you have data in multiple data sources and huge volume)
      iii.            BI/ETL Concerns (work around are available)

Pros: (Features and Benefits):

1) Attractive pricing and licensing terms:
One of the biggest advantages of Redshift is cheaper/less cost. The price is unbeatable. Based on Amazon.com's own use of Redshift that it can manage data at around $1,000 per terabyte of data per year, compared to $19,000 to $25,000 per terabyte of data per year for an On-Premise data warehouse.

Amazon advertises less than $1000 / Terabyte / year. It will be one tenth the cost of conventional on-premises data warehouse deployments.

Delivered as a service, Redshift will cost as little as $1,000 per terabyte, per year versus the estimated $19,000 to $25,000 per terabyte, per year that companies are used to shelling out for on-premises deployments.

Having a warehouse in the cloud makes a lot of sense in some ways. Pay-as-you-go pricing is what you want when you prefer operational costs over capital costs. Many businesses who don’t have I.T. as a core competency will find this a pretty attractive alternative to running servers or an appliance themselves.

Note:  Microsoft SQL Server, Oracle and MySQL, and those aren't systems that are designed to do petabyte-scale data warehousing.
The data warehousing is "expensive and complicated" and "out of reach" for smaller firms.

2) Performance:

Redshift achieves its high performance through extreme parallelism, columnar data storage, smart data compression, Query optimization and Compile code.  Redshift’s performance is adequate. There some third party vendors (airbnb,FlyData, XPlenty etc.) confirmed that they saw 5x performance improvement over other applications like Hive.

a) Massively Parallel Processing (MPP) Architecture - Redshift has a Massively Parallel Processing (MPP) Architecture, which enables Redshift to distribute and parallelize queries across multiple nodes, also enables parallel operations for data loads, backups and restores.

Redshift architecture is inherently parallel; there is no additional tuning or overheads for distribution of loads for the end users.

b) Columnar Storage: Columnar storage supports huge/repetitive column values of a logical table being split into a different physical storage to optimize space and queries.
Redshift provides columnar data storage. With Columnar data storage, all values for a particular column are stored contiguously on the disk in sequential blocks.
·         Columnar data storage helps reduce the I/O requests made to the disk compared to a traditional row based data storage.
·         It also reduces the amount of data loaded from the disk improving the processing speed, as more memory is available for query executions.

c) Data Compression: Specific columns can be compressed to speedup IO and shrink memory requirements. Compression of data further reduces the amount of I/O required for queries.

d) Distribution Keys: - Redshift achieves high query performance by distributing data evenly on all the nodes of a cluster and slices within a node.

Note:  A Redshift cluster is made of multiple nodes and each node has multiple slices. The number of slices is equal to the number of processor cores in a node. Each slice is allocated a portion of node’s memory and disk space. During query execution the data is distributed across slices, the slices operate in parallel to execute the queries.

Note: Sort order for the columns allows for faster predicate lookups, merge joins and aggregate grouping.

e) Compiled code: The leader node distributes fully optimized compiled code across all of the nodes of a cluster. Compiling the query eliminates the overhead associated with an interpreter and therefore increases the execution speed, especially for complex queries. The compiled code is cached and shared across sessions on the same cluster, so subsequent executions of the same query will be faster, often even with different parameters.

f) Query optimizer: The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and also takes advantage of the columnar-oriented data storage. The Amazon Redshift query optimizer implements significant enhancements and extensions for processing complex analytic queries that often include multi-table joins, subqueries, and aggregation.

3) Scalability:  Redshift has the ability to scale to handle huge databases, perhaps well go beyond the petabyte range. You can increase the number of nodes within your data warehouse cluster via the AWS Management Console or the Modify Cluster API, your requested changes will be applied immediately.
•             The existing data warehouse cluster remains available for read operations while a new data warehouse cluster gets created during scaling operations.
•             Amazon Redshift moves data in parallel from the compute nodes in your existing data warehouse cluster to the compute nodes in your new cluster. This enables your operation to complete as quickly as possible.
•             Metrics for compute utilization (CPU), memory utilization, storage utilization, and read/write traffic to your Amazon Redshift data warehouse cluster are available free of charge via the AWS Management Console or Amazon CloudWatch APIs. You can also add additional, user-defined metrics via Amazon Cloudwatch's custom metric functionality.

On Premise DWH (Data Source - Oracle):  Scaling a data warehouse on-premise means buying and installing new hardware which takes plenty of time. Cloud instances, however, can scale up immediately, so it’s easy to upgrade dedicated DWH services such as Redshift.  

4) Security: Security is built-in. Amazon provides various security features/ways to handle Security like Data Encryption , Network Isolation, Access Controls and HW Security Models.

a) Data Encryption:

You can encrypt data at rest and in transit using hardware-accelerated AES-256 and SSL

·         Data Encryption/decryption (Transist/Rest) - Redshift allows data encryption for all data which is stored in the cluster as well as SSL encryption for data in transit.
·         Cluster encryption — To encrypt the data in all your user-created tables, you can enable cluster encryption when you launch the cluster
·         SSL connections — To encrypt the connection between your SQL client and your cluster, you can use secure sockets layer (SSL) encryption
·         Load data encryption — To encrypt your table load data files when you upload them to Amazon S3, you can use either server-side encryption or client-side encryption. When you load from server-side encrypted data, Amazon S3 handles decryption transparently. When you load from client-side encrypted data, the Amazon Redshift COPY command decrypts the data as it loads the table.
·         Data in transit — To protect your data in transit within the AWS cloud, Amazon Redshift uses hardware accelerated SSL to communicate with Amazon S3 or Amazon DynamoDB for COPY, UNLOAD, backup, and restore operations.

b) Network Isolation:

You can isolate your clusters using Amazon VPC (Virtual Private Cloud)
Amazone VPC - Redshift can be launched in Amazon VPC. You can define VPC security groups to restrict inbound access to your clusters.

c) Access controls:
·         Sign-in Controls
·         Access control through Groups/Users and Roles - Access Control can be maintained at the account level using IAM (Identity and Access management) roles.
·         For data base level access control, you can define Redshift database groups and users and restrict access to specific database and tables. Users and groups can be created to grant privilege to specific data in tables.
·         To grant other users inbound access to an Amazon Redshift cluster, you define a cluster security group and associate it with a cluster.


d) HW Security Models:
You can manage your keys using hardware security modules (HSMs)

e) Logs/Audit:
All API calls, connection attempts, queries and changes to the cluster are logged and auditable.

5) Control:  With Amazon VPC, you can define a virtual network topology that closely resembles a traditional network that you might operate in your own datacenter. This gives you complete control over who can access your Amazon Redshift data warehouse cluster.

a) Cluster management – The ability to create, configure, and delete clusters is controlled by the permissions given to the user or account associated with your AWS security credentials.
Virtualized Environment: As Redshift is built on PostgreSQL that can run on commodity hardware and virtualized platform, managing it over cloud and support for disaster plans is inherently taken care of.

b) Database access – Access to resources within an Amazon Redshift database are controlled and you can only access the database objects, such as tables and views, your user ID has been granted permission to access. Creating these Amazon Redshift user accounts and managing their permissions is done by using the CREATE USER, CREATE GROUP, GRANT, and REVOKE SQL statements

c) DB level control - Using the two types of keys (Distribution key,Sort key/Queried key) gives user complete/strong control of data layout of a table across the cluster.
·         Distribution key decides how to distribute data of a table across nodes.
·         Sort key for a table can be chosen when data of the table needs to be stored in sorted manner to support range queries.
·         The general guidelines here would be that the most frequently used accessed key of a table would become distribution key, and the second most used queried key, would be set as sort key.
·         The control over sort keys is not provided by any competitor.

d) Others:
Redshift supports Single Node Clusters to 100 Nodes Clusters with up to 1.6 PB of storage (depending on the processing and storage capacity required).
Redshift nodes come in two sizes XL & 8XL.
XL node comes with 2 TB attached storage and 8XL node comes with 16 TB attached storage.
Clusters can have a maximum of 32 XL nodes (64 TB) or 100 8XL nodes (1.6 PB).

6) Maintenance

a) The setup process is very easy and fast (it took just a few minutes to provision a 16-node cluster) , and you can connect to the system via any Postgres compliant client.
b) Cluster and database configuration is performed through the web UI, or via SQL statements over a JDBC/ODBC connection. You do not have access to the command line on any of the Redshift servers.
c) Web UI: The queries are graphed along with the performance metrics (Ex: query duration ,a list of all queries executed, execution plan for every query etc.)

7) Data Integration and Loading:
a) Bulk Load: Flat files can be used to insert data in bulk and parallel mode into Redshift tables.
b) Standard command and utilities for analyzing query execution plan, monitoring database, locking tables, setting transaction isolation levels are available.
8) Consuming data - The point is that Amazon claims Redshift integrates seamlessly with a list of reporting apps (including Tableau).

Cons:
1) DB Related:

a) Redshift supports round robin and hash distribution methods. Range distribution approach is not implemented by Redshift.

b) No Index Support: Partitioning and sort keys are the only mechanism provided by Redshift to support optimized filters and joins on a table. For filtering data on any other column, the lookups need to span across all the Redshift nodes. Also timestamp type, and arrays are not supported in Redshift.

c) Redshift does not automatically reclaim space that is freed on deletes or updates. When data is deleted or updated, these rows need to be flushed out explicitly by firing VACUUM command.  Databases such as Oracle has much more efficient mechanism for dealing with delete and updates by using ROLLBACK segments.
Although, in a Data warehousing environment updated/deletes are rare, but corner cases for handling late arriving facts need to be analyzed with respect to the limitation.

f) Dummy Foreign and Unique Keys: Redshift supports only the metadata for Primary, Foreign and Unique keys, but other constraints like Check constraint etc., are not supported. It cannot enforce Foreign and Unique keys constraints. But, they are effectively used by the optimizer to generate good execution plans.
Note: Redshift also does not support creation of secondary indexes on columns.

g) Distribution key cannot be changed once a table is created. To change a distribution key, the only work around is to create a new table with the updated distribution key, load data into this table and rename the table as the original table after deleting the original table.
Note: You can only specify one distribution key and if you are joining against multiple columns on a large scale, you might notice performance degradation.
h) Other columnar databases provide the ability to have multiple projections of a single table, each with different encoding and sort orders, but Redshift requires you do this in a separate table.
2) MR integration with SQL: Redshift does not support Map-Reduce programming paradigm as SQL operators to enable interoperability between SQL and other programming languages, while providing scalability.
3) Redshift does not support multi AZ deployments (support only Single AZ deployments), but, you can have a cluster running always in a different AZ, accessing the same set of data from S3.
Note:  Availability Zone (AZ). Each AZ runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable.
4) Data loading: The costs of data migration and integration; in many instances, you'll need huge amounts of bandwidth to transmit the data from internal systems to the cloud-hosted Redshift, or you'll be shipping USB drives via FedEx to Amazon Web Services.
a) Parallel uploads to Redshift are supported only for data stored in Amazon S3 & DynamoDB
Redshift does not have support to load data in parallel from other sources. You will either have to use Insert statements or write scripts to first load data into S3 and then into Redshift. This could sometime be a complex process depending on the size and format of data available with you.
b) First, in order to load your data into Redshift, it has to be in either S3 or Dynamo DB already. The default data loading is single threaded and could take a long time to load all your data.
c) Since Redshift only supports one null value when loading data. If you have multiple ‘NULL’ values in your data what you need to do is to load them as a string into a temp table and cast them back to NULL.
d) The data in json format, needs to convert into flat files, since it is not supported in Redshift.
e) Not having direct access to the server back-end means that bulk data loads cannot be performed from the local file system. Redshift handles this by providing direct access to files stored in S3.
f) Uploading data from one source isn't complicated, but the delays and complexities of data movement multiply as the number of sources increases.

g) The costs of data migration and integration; in many instances, you'll need huge amounts of bandwidth to transmit the data from internal systems to the cloud-hosted Redshift, or you'll be shipping USB drives via FedEx to Amazon Web Services.
h) If data is not already in the Amazon cloud, getting it there remains a pain. Locating your analytic RDBMS on the same premises where the data is created makes life simpler.
5) Query Concurrency in a cluster - Redshift enforces a query concurrency limit of 15 on a cluster and a maximum of 8 queues. The concurrent queries for a cluster across queues is limited to a maximum of
Users cannot modify this configuration.
Note: Queries in are executed in a queue, by default there is one queue per query cluster which can run up to five concurrent queries. Users can modify the configuration to allow up to 15 queries per queue and a maximum of 8 queues.
6) Schema Migration - Even though Redshift is based on Postgres 8.0, the “subtle” differences are big enough, forcing you into the Redshift way of doing things.
7) Risks:
a) Amazon's new Redshift data warehousing service is billed as a game changer, but the supplier of its underlying database says lots of customers won't go for the cloud service.
b) On security, Amazon insists it has all the corporate concerns covered with Redshift and its other cloud services. But in the financial services industry, where ParAccel counts Goldman Sachs and Royal Bank of Scotland as marquis customers, Berger said many companies "just aren't going to move their data or their applications into a public cloud." 
The biggest issues holding back Redshift are the concerns that come along with using the public cloud in general, as some just don't feel comfortable putting sensitive financial or personally identifiable data in anyone's public cloud.
8) On performance - when it comes to getting data into and out of the database. It's obviously faster to load data in your own data center than it is to send it across even the highest performance connectivity solutions that Amazon has available and there will be minimal latency compared to Amazon Redshift.
Another performance differentiator - is the availability of high-speed connectors for Oracle and Teradata databases and Hadoop as a data source to ParAccel. These connectors have not been licensed by Amazon -- not yet.
9) ETL and BI:
a) ETL is non-trivial - It took us quite a while to figure out how to import data from Hive, especially for big tables and data consistency issue.

b)BI systems will also have to operate in the cloud in order to avoid the potentially time-consuming step of moving data back and forth between on-premises systems and the cloud. Running BI in the cloud will add to total cost and not all industry leading BI tools are supported (except MicroStrategy, Jaspersoft, Informatica etc.)
If you stick with running BI in your data center, you have to remember that bringing data back on premises for analysis will introduce additional delays over and above the time required to load data into the cloud.
Limits:
•             There are fixed limits on the number of nodes that you can allocate provision for a cluster, even if you request a quota increase. These limits are different for different node types.
•             The number of permanent tables that you can create per cluster is 9990.
•             The number of user-defined databases you can create per cluster is 60.
•             The number of schemas you can create per database is 256.
•             The number of concurrent user connections that can be made to a cluster is 500.
•             The number of AWS accounts you can authorize to restore a snapshot is 20.

Other Observations:

Amazon Redshift is fully managed, so you no longer need to worry about provisioning hardware, installation, configuration or patching of system or database software, Because your business is dependent on your data, Amazon Redshift takes care to protect it by replicating all data within the cluster as well as in S3.
Despite the potential for big data analysis, Amazon seemed intent to highlight the potential for small and midsize companies to get into data warehousing at a very low cost.

Comparing with Hadoop:

1) Scalability: If your Big Data goes beyond 1.6PB, Redshift will not do. Also, when scaling Amazon’s clusters, the data needs to be reshuffled amongst the machines. It could take several days and plenty of CPU power, thus slowing your system for regular operations.
Hadoop scales to as many petabytes as you want, all the more so on the cloud. Scaling Hadoop doesn’t require reshuffling since new data will simply be saved on the new machines. In case you do want to balance the data, there is a rebalancer utility available.

2) Performance: According to several performance tests made by the FlyData, Airbnb nerds, a Redshift 16 node cluster performed a lot faster than a Hive/Elastic Mapreduce 44 node cluster. Unlike Redshift, Hadoop doesn’t have hard resource limitations for maintenance tasks.

3) Pricing:
Redshift - Paying $1000/TB/year only applies for 3 years of a reserved XL Node with 2TB of storage in US East (North Virginia). Working with the same node and the same region on-demand costs $3,723/TB/year, more than triple the price. Choosing the region of Asia Pacific costs even more.
On premise Hadoop is definitely more expensive. That’s about $5,040/TB/year including maintenance and everything.

4) Ease of Use: Redshift has automated tasks for data warehouse administration and automatic backups to Amazon S3. Transitioning to Redshift should be easy. Where as for Hadoop, your system administrators will need to learn Hadoop architecture and tools and your developers will need to learn coding in Pig or MapReduce.

5) Data Format:
Redshift only accepts flat text files in a fixed format such as CSV.
Redshift only supports certain data types. The serial data type, arrays, and XML are unsupported at the moment.
Redshift doesn’t support multiple NULLs in your data either.
Hadoop accepts every data format and data type imaginable.

6) Data Storage:
Redshift data can only be stored on Amazon S3 or DynamoDB, you’ll need to spend extra time preparing and uploading the data.
Redshift loads data via a single thread by default, so it could take a some time to load.
Amazon suggests S3 best practices to speed up the process such as splitting the data into multiple files, compressing them, using a manifest file, etc. Moving the data to DynamoDB is of course a bigger headache, unless it’s already there.
Life is more flexible with Hadoop. You can store data on local drives, in a relational database, or in the cloud (S3 included), and then import them straight into the Hadoop cluster.
As Airbnb concluded in their benchmark: "We don’t think Redshift is a replacement of the Hadoop family due to its limitations, but rather it is a very good complement to Hadoop for interactive analytics.

Clarifications from Amazon regarding the Security Concerns:

i.              Amazon Redshift starts with the security foundation underlying all AWS services. There are physical controls for datacenter access.
ii.             Machines are wiped before they are provisioned.
iii.            Physical media is destroyed before leaving our facilities.
iv.           Drivers, BIOS, and NICs are protected.
v.            Access by our staff is monitored, logged and audited.
vi.           Automatic configuring multiple firewalls, known as security groups, to control access to customer data warehouse clusters.
vii.          Customers can explicitly set up ingress and egress rules or place their SQL endpoint inside their own VPC, isolating it from the rest of the AWS cloud. For multi-node clusters, the nodes storing customer data are isolated in their own security group, preventing direct access by the customer or the rest of the AWS network.
viii.         Customers can require SSL for their own accesses to their cluster while the AWS operations to monitor and manage their cluster are always secured by SSL.
ix.           By using unique keys per-block and per-cluster, Amazon Redshift dramatically reduces the cost of key rotation and helps prevent unauthorized splicing of data from different blocks or clusters.

Gray Areas (AZRS)

a) Performance
i.                                 No bench marking statistics available for data upload.
ii.                               BI/Query benchmarks may not guarantee same with other BI tools

b) Data Uploading/Integration
iii.                              Parallel operations are available for data loads from S3 to RedShift, but not from customer source to S3
iv.                             No proven customer use cases w.r.t E2E implementation

c) Cost
v.                               Cost details for third party BI Tools over the cloud
vi.                             Data Retention/Storage Costs (after provision storage)