Home » Uncategorized

Setting up Your Analytics Stack with Jupyter Notebook & AWS Redshift

 

From querying your data and visualizing it all in one place, to documenting your work and building interactive charts and dashboards, to running machine learning algorithms on top of your data and sharing the results with your team, there are very few limits to what one can do with the Jupyter + Redshift stack. However, setting everything up and resolving all the package dependencies can be a painful experience.

In this blog post I will walk you though the exact steps needed to set up Jupyter Notebook to connect to your private data warehouse in AWS Redshift.

 

What is Jupyter Notebook?

Setting up Your Analytics Stack with Jupyter Notebook & AWS Redshift

Jupyter Notebook is an open-source data science tool used by many data scientists and data analysts at some of the most data-driven organizations in the world, including Google, Microsoft, IBM, Bloomberg, O’Reilly and NASA.

An extension of the IPython project , Jupyter Notebook is an application that runs directly in your browser and allows you to create and share documents with live code from over 40 different languages. You can also supplement these documents with explanatory text, equations, and rich visualizations.

Working with Jupyter is literally like writing a notebook. You can document your work, make it accessible to other people and easily reproduce it in the future. Once the notebook is set up, you can readily share it with others. For example, GitHub natively renders notebook files as soon as they are committed into a repository. Alternatively, you can export your notebook into a number of different formats, including PDF, HTML, Markdown, and reST.

Combined with its open-source nature, big data integration capabilities (e.g., Apache Spark via Python, R, or Scala), and the ability to execute code in virtually any language of your choice, Jupyter Notebook is particularly well-suited for exploring data, experimenting with it, and capturing entire data-driven workflows (see also our Data Science Roundup #11 discussing the recent rise of notebooks).

Installing Jupyter and Python

 

If you have never worked with Jupyter before or if you are not sure whether it is installed on your machine, please take a look at this brief installation article from the developers of Jupyter. Several Python distributions (e.g., Anaconda) conveniently ship with both Python and Jupyter.

What is AWS Redshift?

 

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse. With Redshift, it has never been easier or more cost-effective to securely store your data in the cloud. On-demand pricing means that you can get started for as little as 25 cents an hour and you don’t have to worry about backup, durability, availability, security, monitoring, or maintenance: AWS takes care of all of that for you.

Because of its ease of use, Redshift is experiencing rapid adoption among light quants, growth hackers, seasoned analysts and data scientists. It integrates seamlessly with the languages you may be already familiar with, including SQL, R, and Python. For the purpose of this post, I will be focusing on the latter.

Setting Up Your AWS Redshift Instance

 

Setting up your Redshift instance is easy. AWS provides an excellent Getting Started with Amazon Redshift guide and the whole process takes only a few minutes.

Once you have your Redshift instance up and running, I would encourage you to check out RJMetrics Pipeline . Pipeline will connect to all of the disparate data sources you might have, including databases and SaaS products , and stream your data into one single place in Redshift. With Pipeline, you can start analyzing all of your data literally in minutes and not pay a penny if you stream less than 5 million rows/events. Here is a four-minute tutorial that will walk you through setting up your data warehouse to work with Pipeline.

1. Install Requirements

 

We will need to install a few packages in order to connect to and work with AWS Redshift. Note that depending on your Python distribution, you may need to install additional packages. I recommend that you use the completely free Anaconda distribution as it already comes with many of the required packages installed and the packages are tested and guaranteed to work together. If working with Anaconda, you can use the built-in package management system conda. Alternatively, you can use Python’s built-in pip package management system or Homebrew (OS X only).

 

I will use a combination of Homebrew and pip in the following. Note that you may need to use pip3 if working with Python versions 3.3 or later.

 

  1. Install PostgreSQL

    First, since Amazon Redshift is based on PostgreSQL 8.0.2, we will need a PostgreSQL client library.

    If using Mac OS X, simply simply open up your terminal and type brew install postgresql

    For other operating systems, please see the installation instructions here .

  2. Install Python Requirements

    Second, we will need a few python packages to connect to and get data from Redshift. You can simply download the requirements.txt file, navigate to the folder containing it and type pip install -r redshift_requirements.txt .

    (Pro tip: you may want to create a virtual python environment to keep the package dependencies required by different projects in separate locations. See Python and Conda documentation for more details.)

    Here is a brief description of the four packages included in redshift_requirements.txt :

    1. Psycopg2 : Python adapter needed to connect to Redshift.

    2. SQLAlchemy : Python SQL toolkit and Object Relational Mapper that provides an efficient way to access a database.

    3. iPython-SQL : provides a straightforward way to write SQL and get data back. Introduces magic commands %sql and %%sql so that you can write plain SQL and get back your query results in the form of a dataframe.

    4. SimpleJSON : simple and fast JSON encoder and decoder. We will store Redshift credentials in a JSON-formatted file and will make use of the package to read that file.

 

2. Save your Redshift Credentials

 

To access you Redshift instance, you will need to know your (i) user name, (ii) password, (iii) host name, (iv) port and (v) database name. If you personally set up your Redshift instance, you will already know what these are. If not, please ask someone who did or take a look at the AWS documentation on how to obtain these credentials. Note that the port number for AWS Redshift is 5439.

 

So that you don’t expose any of the sensitive information above in your Notebook (e.g, when sharing your notebook with other people or when storing your notebook in a GitHub repository), please save your Redshift credentials inside a JSON-formatted file called redshift_creds.json.nogit. You can download the template for that file here . The contents of the file is pretty self-explanatory:


Setting up Your Analytics Stack with Jupyter Notebook & AWS Redshift

 

I would suggest saving the redshift_creds.json.nogit file in a folder that is separate from the one with your Jupyter file in it so that you don’t accidentally store the credentials on GitHub. For extra security, you can specify the *.nogit pattern in the .gitingore file of your repository.

3. Whitelist your IP address in the AWS Panel

 

By default, your Redshift cluster is locked down so that nobody has access to it. You can grant people inbound access to your cluster via the AWS panel by creating a security group and specifying which IP addresses are allowed to connect to Redshift. Makes sure that your IP address is included in one of these rules. Please see the Amazon Redshift Cluster Security Groups documentation from AWS for more details.

 

4. Connect to Redshift!

 

You are now all set to connect to your Redshift instance!

For further instructions and a copy of the Jupyter notebook that you can run to connect to Redshift, execute a query and convert the query result into a pandas dataframe, please follow this link.