Here, I'd like to share my observations on Amazon Redshift, Architecture, Pros&Cons, Comparisons with on-premise DWH etc.
One of the Good Sales Offer that grabbed the Industry attention is "AZ Redshift" from Amazon.
One of the Good Sales Offer that grabbed the Industry attention is "AZ Redshift" from Amazon.
Amazon Redshift is cloud based
fast, fully managed, petabyte-scale Data
Warehouse service by Amazon Web Services (AWS). This is cost-effective and
efficient solution to collect, store (ETL) and analyze all your data using your
existing business intelligence tools.
Redshift provides a powerful and
cost effective alternative to the traditional on premise data warehouses.
Similar the data warehousing on
cloud has been offered by companies like 1010data,
Data Meer, and Vertica. But Amazon’s basket of cloud and big data services
make it a lucrative platform for enterprise of any size.
Amazon Redshift is a relational
database management system (RDBMS) on cloud and it provides the same
functionality as other RDBMS, such as inserting and deleting data, Amazon
Redshift is optimized for high-performance analysis and reporting of very large
datasets.
Redshift is very impressive
feature-wise for a 1.0 product. There is much in Redshift's favor as an
alternative to on-premises data warehouses, especially if we can understand
distinctive characteristics of the service and which we can use to our
advantage.
Architecture: The components
involved in Redshift are Client Applications, Connections, Clusters (Leader Node,
Compute Nodes, Node slices/processors, Internal Network and Databases.
Client Applications: Amazon
Redshift integrates with various data loading and ETL (extract, transform, and
load) tools and business intelligence (BI) reporting, data mining, and
analytics tools. Amazon Redshift is based on industry-standard PostgreSQL,
works with most existing SQL client applications.
Connections: Amazon Redshift communicates with client applications
by using industry-standard PostgreSQL JDBC and ODBC drivers.
Clusters: The core infrastructure component of an Amazon Redshift data warehouse is a cluster. A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more compute nodes, an additional leader node coordinates the compute nodes and handles external communication.
Your client application interacts
directly only with the leader node. The compute nodes are transparent to
external applications.
Leader Node: The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations. Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node.
The leader node distributes SQL
statements to the compute nodes only when a query references tables that are
stored on the compute nodes. All other queries run exclusively on the leader
node. Amazon Redshift is designed to implement certain SQL functions only on
the leader node. A query that uses any of these functions will return an error
if it references tables that reside on the compute nodes.
Compute Nodes: The compute nodes execute the compiled code and send intermediate results back to the leader node for final aggregation.
Each compute node has its own
dedicated CPU, memory, and attached disk storage, which are determined by the
node type. As your workload grows, you can increase the compute capacity and
storage capacity of a cluster by increasing the number of nodes, upgrading the
node type, or both.
Amazon Redshift provides two node
types; dense storage nodes and dense compute nodes. Each node provides two
storage choices. You can start with a single 160 GB node and scale up to
multiple 16 TB nodes to support a petabyte of data or more.
Node slices: A compute node is partitioned into slices; one slice for each core of the node's multi-core processor. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation.
When you create a table, you can
optionally specify one column as the distribution key. When the table is loaded
with data, the rows are distributed to the node slices according to the
distribution key that is defined for a table. Choosing a good distribution key
enables Amazon Redshift to use parallel processing to load data and execute
queries efficiently.
Internal Network: Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and custom communication protocols to provide private, very high-speed network communication between the leader node and compute nodes. The compute nodes run on a separate, isolated network that client applications never access directly.
Databases: A cluster contains one or more databases. User data is stored on the compute nodes. Your SQL client communicates with the leader node, which in turn coordinates query execution with the compute nodes. Amazon Redshift is optimized for high-performance analysis and reporting of very large datasets.
Note: Internal System Operation
Query processing - Client
requests and execution - Explain plan - Query execution steps
AZRS can be recommended in the following scenario:
If the company objectives are i) Saving cost ii) Avoid maintenance iii) Scalable iv) Performance
Amazon Redshift might make more sense for the following customers:
·
Web startups, whose data is all in the Amazon
cloud anyway, and who need better analytic SQL performance than they can get
from Hadoop.
·
Data mart outsourcers/data sellers, again
probably startups, whose whole business is in the cloud.
·
Individual analysts with small budgets, or very
small analytic groups within enterprises or other organizations.
·
Many businesses who don’t have I.T. as a core
competency will find this a pretty attractive alternative to running servers or
an appliance themselves.
Cannot be recommended in the following scenarios:
i.
Security is main concern (though it seems to be
secure over the cloud)
ii.
Data uploading concerns (ex: If you have data in
multiple data sources and huge volume)
iii.
BI/ETL Concerns (work around are available)
Pros: (Features and Benefits):
1) Attractive pricing and licensing terms:
One of the biggest advantages of
Redshift is cheaper/less cost. The price is unbeatable. Based on Amazon.com's
own use of Redshift that it can manage data at around $1,000 per terabyte of
data per year, compared to $19,000 to $25,000 per terabyte of data per year for
an On-Premise data warehouse.
Amazon advertises less than $1000 / Terabyte / year. It will be one tenth the cost of conventional on-premises data warehouse deployments.
Delivered as a service, Redshift will cost as little as $1,000 per terabyte, per year versus the estimated $19,000 to $25,000 per terabyte, per year that companies are used to shelling out for on-premises deployments.
Having a warehouse in the cloud makes a lot of sense in some ways. Pay-as-you-go pricing is what you want when you prefer operational costs over capital costs. Many businesses who don’t have I.T. as a core competency will find this a pretty attractive alternative to running servers or an appliance themselves.
Note: Microsoft SQL Server, Oracle and MySQL, and those aren't systems that are designed to do petabyte-scale data warehousing.
The data warehousing is
"expensive and complicated" and "out of reach" for smaller
firms.
2) Performance:
Redshift achieves its high performance through extreme parallelism, columnar data storage, smart data compression, Query optimization and Compile code. Redshift’s performance is adequate. There some third party vendors (airbnb,FlyData, XPlenty etc.) confirmed that they saw 5x performance improvement over other applications like Hive.
a) Massively Parallel Processing (MPP) Architecture - Redshift has a Massively Parallel Processing (MPP) Architecture, which enables Redshift to distribute and parallelize queries across multiple nodes, also enables parallel operations for data loads, backups and restores.
Redshift architecture is inherently parallel; there is no additional tuning or overheads for distribution of loads for the end users.
b) Columnar Storage: Columnar storage supports huge/repetitive column values of a logical table being split into a different physical storage to optimize space and queries.
Redshift provides columnar data
storage. With Columnar data storage, all values for a particular column are
stored contiguously on the disk in sequential blocks.
·
Columnar data storage helps reduce the I/O
requests made to the disk compared to a traditional row based data storage.
·
It also reduces the amount of data loaded from
the disk improving the processing speed, as more memory is available for query
executions.
c) Data Compression: Specific columns can be compressed to speedup IO and shrink memory requirements. Compression of data further reduces the amount of I/O required for queries.
d) Distribution Keys: - Redshift achieves high query performance by
distributing data evenly on all the nodes of a cluster and slices within a
node.
Note: A Redshift cluster is made of multiple nodes and each node has multiple slices. The number of slices is equal to the number of processor cores in a node. Each slice is allocated a portion of node’s memory and disk space. During query execution the data is distributed across slices, the slices operate in parallel to execute the queries.
Note: Sort order for the columns allows for faster predicate lookups, merge joins and aggregate grouping.
e) Compiled code: The leader node distributes fully optimized compiled code across all of the nodes of a cluster. Compiling the query eliminates the overhead associated with an interpreter and therefore increases the execution speed, especially for complex queries. The compiled code is cached and shared across sessions on the same cluster, so subsequent executions of the same query will be faster, often even with different parameters.
f) Query optimizer: The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and also takes advantage of the columnar-oriented data storage. The Amazon Redshift query optimizer implements significant enhancements and extensions for processing complex analytic queries that often include multi-table joins, subqueries, and aggregation.
3) Scalability: Redshift has the ability to scale to handle huge databases, perhaps well go beyond the petabyte range. You can increase the number of nodes within your data warehouse cluster via the AWS Management Console or the Modify Cluster API, your requested changes will be applied immediately.
• The
existing data warehouse cluster remains available for read operations while a
new data warehouse cluster gets created during scaling operations.
• Amazon
Redshift moves data in parallel from the compute nodes in your existing data
warehouse cluster to the compute nodes in your new cluster. This enables your
operation to complete as quickly as possible.
• Metrics
for compute utilization (CPU), memory utilization, storage utilization, and
read/write traffic to your Amazon Redshift data warehouse cluster are available
free of charge via the AWS Management Console or Amazon CloudWatch APIs. You
can also add additional, user-defined metrics via Amazon Cloudwatch's custom
metric functionality.
On Premise DWH (Data Source - Oracle): Scaling a data warehouse on-premise means buying and installing new hardware which takes plenty of time. Cloud instances, however, can scale up immediately, so it’s easy to upgrade dedicated DWH services such as Redshift.
4) Security: Security is built-in. Amazon provides various security
features/ways to handle Security like Data Encryption , Network Isolation,
Access Controls and HW Security Models.
a) Data Encryption:
You can encrypt data at rest and in transit using hardware-accelerated AES-256 and SSL
· Data Encryption/decryption (Transist/Rest) - Redshift allows data encryption for all data which is stored in the cluster as well as SSL encryption for data in transit.
·
Cluster encryption — To encrypt the data in all
your user-created tables, you can enable cluster encryption when you launch the
cluster
·
SSL connections — To encrypt the connection
between your SQL client and your cluster, you can use secure sockets layer
(SSL) encryption
·
Load data encryption — To encrypt your table
load data files when you upload them to Amazon S3, you can use either
server-side encryption or client-side encryption. When you load from server-side
encrypted data, Amazon S3 handles decryption transparently. When you load from
client-side encrypted data, the Amazon Redshift COPY command decrypts the data
as it loads the table.
·
Data in transit — To protect your data in
transit within the AWS cloud, Amazon Redshift uses hardware accelerated SSL to
communicate with Amazon S3 or Amazon DynamoDB for COPY, UNLOAD, backup, and
restore operations.
b) Network Isolation:
You can isolate your clusters using Amazon VPC (Virtual Private Cloud)
Amazone VPC - Redshift can be
launched in Amazon VPC. You can define VPC security groups to restrict inbound
access to your clusters.
c) Access controls:
·
Sign-in Controls
·
Access control through Groups/Users and Roles -
Access Control can be maintained at the account level using IAM (Identity and
Access management) roles.
·
For data base level access control, you can
define Redshift database groups and users and restrict access to specific
database and tables. Users and groups can be created to grant privilege to specific
data in tables.
·
To grant other users inbound access to an Amazon
Redshift cluster, you define a cluster security group and associate it with a
cluster.
d) HW Security Models:
You can manage your keys using
hardware security modules (HSMs)
e) Logs/Audit:
All API calls, connection attempts, queries and changes to the cluster
are logged and auditable.
5) Control: With Amazon VPC, you can define a virtual network topology that closely resembles a traditional network that you might operate in your own datacenter. This gives you complete control over who can access your Amazon Redshift data warehouse cluster.
a) Cluster management – The ability to create, configure, and delete clusters is controlled by the permissions given to the user or account associated with your AWS security credentials.
Virtualized Environment: As
Redshift is built on PostgreSQL that can run on commodity hardware and
virtualized platform, managing it over cloud and support for disaster plans is
inherently taken care of.
b) Database access – Access to resources within an Amazon Redshift database are controlled and you can only access the database objects, such as tables and views, your user ID has been granted permission to access. Creating these Amazon Redshift user accounts and managing their permissions is done by using the CREATE USER, CREATE GROUP, GRANT, and REVOKE SQL statements
c) DB level control - Using the two types of keys (Distribution key,Sort key/Queried key) gives user complete/strong control of data layout of a table across the cluster.
·
Distribution key decides how to distribute data
of a table across nodes.
·
Sort key for a table can be chosen when data of
the table needs to be stored in sorted manner to support range queries.
·
The general guidelines here would be that the
most frequently used accessed key of a table would become distribution key, and
the second most used queried key, would be set as sort key.
·
The control over sort keys is not provided by
any competitor.
d) Others:
Redshift supports Single Node Clusters
to 100 Nodes Clusters with up to 1.6 PB of storage (depending on the processing
and storage capacity required).
Redshift nodes come in two sizes
XL & 8XL.
XL node comes with 2 TB attached
storage and 8XL node comes with 16 TB attached storage.
Clusters can have a maximum of 32
XL nodes (64 TB) or 100 8XL nodes (1.6 PB).
6) Maintenance
a) The setup process is very easy
and fast (it took just a few minutes to provision a 16-node cluster) , and you
can connect to the system via any Postgres compliant client.
b) Cluster and database
configuration is performed through the web UI, or via SQL statements over a
JDBC/ODBC connection. You do not have access to the command line on any of the
Redshift servers.
c) Web UI: The queries are
graphed along with the performance metrics (Ex: query duration ,a list of all
queries executed, execution plan for every query etc.)
7) Data Integration and Loading:
a) Bulk Load: Flat files can be
used to insert data in bulk and parallel mode into Redshift tables.
b) Standard command and utilities
for analyzing query execution plan, monitoring database, locking tables,
setting transaction isolation levels are available.
8) Consuming data - The point is
that Amazon claims Redshift integrates seamlessly with a list of reporting apps
(including Tableau).
Cons:
1) DB Related:
a) Redshift supports round robin and hash distribution methods. Range distribution approach is not implemented by Redshift.
b) No Index Support: Partitioning and sort keys are the only mechanism provided by Redshift to support optimized filters and joins on a table. For filtering data on any other column, the lookups need to span across all the Redshift nodes. Also timestamp type, and arrays are not supported in Redshift.
c) Redshift does not automatically reclaim space that is freed on deletes or updates. When data is deleted or updated, these rows need to be flushed out explicitly by firing VACUUM command. Databases such as Oracle has much more efficient mechanism for dealing with delete and updates by using ROLLBACK segments.
Although, in a Data warehousing
environment updated/deletes are rare, but corner cases for handling late
arriving facts need to be analyzed with respect to the limitation.
f) Dummy Foreign and Unique Keys: Redshift supports only the metadata for Primary, Foreign and Unique keys, but other constraints like Check constraint etc., are not supported. It cannot enforce Foreign and Unique keys constraints. But, they are effectively used by the optimizer to generate good execution plans.
Note: Redshift also does not
support creation of secondary indexes on columns.
g) Distribution key cannot be changed once a table is created. To
change a distribution key, the only work around is to create a new table with
the updated distribution key, load data into this table and rename the table as
the original table after deleting the original table.
Note: You can only specify one
distribution key and if you are joining against multiple columns on a large
scale, you might notice performance degradation.
h) Other columnar databases
provide the ability to have multiple
projections of a single table, each with different encoding and sort
orders, but Redshift requires you do this in a separate table.
2) MR integration with SQL: Redshift does not support Map-Reduce
programming paradigm as SQL operators to enable interoperability between SQL
and other programming languages, while providing scalability.
3) Redshift does not support multi AZ deployments (support only Single
AZ deployments), but, you can have a cluster running always in a different AZ,
accessing the same set of data from S3.
Note: Availability Zone (AZ). Each AZ runs on its
own physically distinct, independent infrastructure, and is engineered to be
highly reliable.
4) Data loading: The costs of data migration and integration; in many
instances, you'll need huge amounts of bandwidth to transmit the data from
internal systems to the cloud-hosted Redshift, or you'll be shipping USB drives
via FedEx to Amazon Web Services.
a) Parallel uploads to Redshift are supported only for data stored in
Amazon S3 & DynamoDB
Redshift does not have support to
load data in parallel from other sources. You will either have to use Insert
statements or write scripts to first load data into S3 and then into Redshift.
This could sometime be a complex process depending on the size and format of
data available with you.
b) First, in order to load your data into Redshift, it has to be in either
S3 or Dynamo DB already. The default data loading is single threaded and
could take a long time to load all your data.
c) Since Redshift only supports
one null value when loading data. If you have multiple ‘NULL’ values in your
data what you need to do is to load them as a string into a temp table and cast
them back to NULL.
d) The data in json format, needs
to convert into flat files, since it is not supported in Redshift.
e) Not having direct access to
the server back-end means that bulk data loads cannot be performed from the local
file system. Redshift handles this by providing direct access to files stored
in S3.
f) Uploading data from one source
isn't complicated, but the delays and complexities
of data movement multiply as the number of sources increases.
g) The costs of data migration
and integration; in many instances, you'll need
huge amounts of bandwidth to transmit the data from internal systems to the
cloud-hosted Redshift, or you'll be shipping USB drives via FedEx to Amazon Web
Services.
h) If data is not already in the
Amazon cloud, getting it there remains a pain. Locating your analytic RDBMS on
the same premises where the data is created makes life simpler.
5) Query Concurrency in a cluster - Redshift enforces a query
concurrency limit of 15 on a cluster and a maximum of 8 queues. The concurrent
queries for a cluster across queues is limited to a maximum of
Users cannot modify this
configuration.
Note: Queries in are executed in
a queue, by default there is one queue per query cluster which can run up to
five concurrent queries. Users can modify the configuration to allow up to 15
queries per queue and a maximum of 8 queues.
6) Schema Migration - Even though Redshift is based on Postgres 8.0,
the “subtle” differences are big enough, forcing you into the Redshift way of
doing things.
7) Risks:
a) Amazon's new Redshift data
warehousing service is billed as a game changer, but the supplier of its
underlying database says lots of customers won't go for the cloud service.
b) On security, Amazon insists it
has all the corporate concerns covered with Redshift and its other cloud
services. But in the financial services industry, where ParAccel counts Goldman
Sachs and Royal Bank of Scotland as marquis customers, Berger said many
companies "just aren't going to move their data or their applications into
a public cloud."
The biggest issues holding back
Redshift are the concerns that come along with using the public cloud in
general, as some just don't feel comfortable putting sensitive financial or
personally identifiable data in anyone's public cloud.
8) On performance - when it comes to getting data into and out of
the database. It's obviously faster to load data in your own data center than
it is to send it across even the highest performance connectivity solutions
that Amazon has available and there will be minimal latency compared to Amazon
Redshift.
Another performance
differentiator - is the availability of high-speed connectors for Oracle and
Teradata databases and Hadoop as a data source to ParAccel. These connectors
have not been licensed by Amazon -- not yet.
9) ETL and BI:
a) ETL is non-trivial - It took
us quite a while to figure out how to import data from Hive, especially for big
tables and data consistency issue.
b)BI systems will also have to
operate in the cloud in order to avoid the potentially time-consuming step of
moving data back and forth between on-premises systems and the cloud. Running
BI in the cloud will add to total cost and not all industry leading BI tools
are supported (except MicroStrategy, Jaspersoft, Informatica etc.)
If you stick with running BI in
your data center, you have to remember that bringing data back on premises for
analysis will introduce additional delays over and above the time required to
load data into the cloud.
Limits:
• There
are fixed limits on the number of nodes that you can allocate provision for a
cluster, even if you request a quota increase. These limits are different for
different node types.
• The
number of permanent tables that you can create per cluster is 9990.
• The
number of user-defined databases you can create per cluster is 60.
• The
number of schemas you can create per database is 256.
• The
number of concurrent user connections that can be made to a cluster is 500.
• The
number of AWS accounts you can authorize to restore a snapshot is 20.
Other Observations:
Amazon Redshift is fully managed, so you no longer need to worry about provisioning hardware, installation, configuration or patching of system or database software, Because your business is dependent on your data, Amazon Redshift takes care to protect it by replicating all data within the cluster as well as in S3.
Despite the potential for big
data analysis, Amazon seemed intent to highlight the potential for small and
midsize companies to get into data warehousing at a very low cost.
Comparing with Hadoop:
1) Scalability: If your Big Data goes beyond 1.6PB, Redshift will not do. Also, when scaling Amazon’s clusters, the data needs to be reshuffled amongst the machines. It could take several days and plenty of CPU power, thus slowing your system for regular operations.
Hadoop scales to as many
petabytes as you want, all the more so on the cloud. Scaling Hadoop doesn’t
require reshuffling since new data will simply be saved on the new machines. In
case you do want to balance the data, there is a rebalancer utility available.
2) Performance: According to several performance tests made by the
FlyData, Airbnb nerds, a Redshift 16 node cluster performed a lot faster than a
Hive/Elastic Mapreduce 44 node cluster. Unlike Redshift, Hadoop doesn’t have
hard resource limitations for maintenance tasks.
3) Pricing:
Redshift - Paying $1000/TB/year
only applies for 3 years of a reserved XL Node with 2TB of storage in US East
(North Virginia). Working with the same node and the same region on-demand
costs $3,723/TB/year, more than triple the price. Choosing the region of Asia
Pacific costs even more.
On premise Hadoop is definitely
more expensive. That’s about $5,040/TB/year including maintenance and
everything.
4) Ease of Use: Redshift has automated tasks for data warehouse administration and automatic backups to Amazon S3. Transitioning to Redshift should be easy. Where as for Hadoop, your system administrators will need to learn Hadoop architecture and tools and your developers will need to learn coding in Pig or MapReduce.
5) Data Format:
Redshift only accepts flat text
files in a fixed format such as CSV.
Redshift only supports certain
data types. The serial data type, arrays, and XML are unsupported at the
moment.
Redshift doesn’t support multiple
NULLs in your data either.
Hadoop accepts every data format
and data type imaginable.
6) Data Storage:
Redshift data can only be stored
on Amazon S3 or DynamoDB, you’ll need to spend extra time preparing and
uploading the data.
Redshift loads data via a single
thread by default, so it could take a some time to load.
Amazon suggests S3 best practices
to speed up the process such as splitting the data into multiple files,
compressing them, using a manifest file, etc. Moving the data to DynamoDB is of
course a bigger headache, unless it’s already there.
Life is more flexible with
Hadoop. You can store data on local drives, in a relational database, or in the
cloud (S3 included), and then import them straight into the Hadoop cluster.
As Airbnb concluded in their
benchmark: "We don’t think Redshift is a replacement of the Hadoop family
due to its limitations, but rather it is a very good complement to Hadoop for
interactive analytics.
Clarifications from Amazon regarding the Security Concerns:
i. Amazon Redshift starts with the security foundation
underlying all AWS services. There are physical controls for datacenter access.
ii. Machines are wiped before they are provisioned.
iii. Physical media is destroyed before leaving our
facilities.
iv. Drivers, BIOS, and NICs are protected.
v. Access by our staff is monitored, logged and audited.
vi. Automatic configuring multiple firewalls, known as
security groups, to control access to customer data warehouse clusters.
vii. Customers can explicitly set up ingress and egress rules or
place their SQL endpoint inside their own VPC, isolating it from the rest of
the AWS cloud. For multi-node clusters, the nodes storing customer data are
isolated in their own security group, preventing direct access by the customer
or the rest of the AWS network.
viii. Customers can require SSL for their own accesses to their
cluster while the AWS operations to monitor and manage their cluster are always
secured by SSL.
ix. By using unique keys per-block and per-cluster, Amazon
Redshift dramatically reduces the cost of key rotation and helps prevent
unauthorized splicing of data from different blocks or clusters.
Gray Areas (AZRS)
a) Performance
i.
No bench marking statistics available for data
upload.
ii.
BI/Query benchmarks may not guarantee same with
other BI tools
b) Data Uploading/Integration
iii.
Parallel operations are available for data loads
from S3 to RedShift, but not from customer source to S3
iv.
No proven customer use cases w.r.t E2E implementation
c) Cost
v.
Cost details for third party BI Tools over the
cloud
vi.
Data Retention/Storage Costs (after provision
storage)