Wednesday, April 2, 2014

Amazon Redshift Experience

Purpose is to analyse a big SQL table. For example, the table I am going to query contains 90 million lines and each row has 17 columns.  All those data in 20K+ csv files will take 13GB disk space.

Checked on-line about Redshift's performance, here's the comparison of several famous analytic framworks.  All comparison results  says Redshift seems to be my best choice.  With the number of cluster nodes increasing, the performance is guaranteed to improve.

Important:
Loads data from Amazon S3 to Redshift,and S3 and Redshift MUST in the same region or you will get error.


Loads data from local to S3
Before I uploads data to S3, I split my 13GB file to 20K+ files, and then compress them into GZIP(the reason is you can import GZIP file to Redshift database directly, I like this!), I wrote a python script to gzip all files and upload to S3(boto python).

  • s3cmd
  • boto python library
  • Amazon S3 upload option
  • Amazon S3 CLI


Loads data from S3 to Redshift
Installs Postgres8.4 as a client side to communicate with Redshift database.
The following command is accessing the database. -h is the redshift endpoint name, -U is the master name, -d is database name and -p is port number
$ psql -h ... -U ... -d ... -p ...

The alternative way is workbenchSQL which is not working on my ubuntu12.04

I wrote another python script to loads data from S3 to Redshift.  You need to know is there's no way to provide password in the shell commands
for example, it is not possible to provide password to the command "psql -h ... -U ... -d ... -p ..."
I solved this problems with a hacky way:

$ export  PASSWORD=yourpassword                                                            
$ psql -h ... -U ... -d ... -p ... -c "copy ....delimiter... gzip;"

then loop the "COPY" commands to transfer files, it will not prompt for the password anymore.
-c means run the command on your terminal, and line quoted are the sql query.  If you import a csv file, field separate is comma, you have to provide "delimiter ',' ", if the csv file is compressed in gzip, you have to add gzip at the end.  You also need to provide the access key id and secret key in your COPY command.

Performance:
1. At the beginning, single node cluster(dw2.large), queries are
select count(*) from my_table;
select count(distinct columnX) from my_table;
select count(distinct columnY) from my_table where data_date like '2013-%';
the query spending time are 5 seconds, 18 seconds and 1m55seconds respectively.

2. Then resize to 2-node cluster(both dw2.large), execute the 3rd query and time spent decrease to 46 seconds.

Amazon Redshift provides UI to monitor the cluster, you can check status, performance, query history, load history easily.


2 comments:

  1. Hi Haifeng, could you share your dataset?
    It could be interesting to perform some cross test comparaison.

    ReplyDelete
    Replies
    1. Thanks for your message, unfortunately i cannot provide the dataset, it belongs to where i work for. Why not create some dummy data which will be more fit for your needs:) In addition, I also tested MariaDB Galera Cluster and it works very good, same data as feed to Redshift, the query response time is slightly slower but still acceptable:)

      Delete