Subscribe to DSC Newsletter

Starting to develop in PySpark with Jupyter installed in a Big Data Cluster

Is not a secret that Data Science tools like Jupyter, Apache Zeppelin or the more recently launched Cloud Data Lab and Jupyter Lab are a must be known for the day by day work so How could be combined the power of easily developing models and the capacity of computation of a Big Data Cluster? Well in this article I will share very simple step to start using Jupyter notebooks for PySpark in a Data Proc Cluster in GCP.


Final goal


Image result for jupyter spark

Prerequisites

1. Have a Google Cloud account (Just log in with your Gmail and automatically get $300 of credit for one year) [1]

2. Create a new project with your favorite name



Steps

  1. In order to make easier the deployment, I’m going to use a beta featurethat only can be applied when creating a Data Proc Cluster through Google Cloud Shell. For our cluster, we need to define many features like numbers of workers, master´s high availability, amount of RAM an Hard Drive, etc. To make easy I recommend simulating the creation of the cluster by the UI. First we need to enable Dataproc (figures 1 and 2).

Figure 1 Enable Dataproc API I


Figure 2 Enable Dataproc API II

2. Get the equivalent command line simulating the creation process with your own cluster´ size. I’m going to set basic specs:

  • Region: global
  • Cluster mode: Standard
  • Master node: 2 vCPUs, 7.5GB memory, and 300 disk size
  • Workers nodes: 2vCPUs, 7.5GB memory, and 200 disk size

Simulate creating a cluster through UI


Basic specs

Important: You should click Advance options and change Image to 1.3 Debian 9 to make beta parameters works.

To access click Advance options


Change to 1.3 Debian 9

3. Get equivalent command line


Click in command line


Copy the gcloud command

4. Close the simulation and click to Activate Cloud Shell


Activate Cloud Shell

5. Modify your command adding and run (could take several minutes)

— optional-components=ANACONDA,JUPYTER

Change

gcloud dataproc clusters to gcloud beta dataproc clusters

Run

gcloud beta dataproc clusters create cluster-jupyter — subnet default — zone europe-west1-d — master-machine-type n1-standard-2 — master-boot-disk-size 300 — num-workers 2 — worker-machine-type n1-standard-2 — worker-boot-disk-size 200 — optional-components=ANACONDA,JUPYTER — image-version 1.3-deb9 — project jupyter-cluster-223203


running in shell


cluster created

6. Allow incoming traffic for Jupyter port, search for the firewall rules in the landing page and create a rule.


search Firewall rules VPC network


click on create a rule

7. Define the Firewall rule opening port 8123 and save.


parameters


Rule working

8. Enter your Jupyter notebook! (you need your master IP and add the jupyter default port e.g. http://30.195.xxx.xx:8123 )


get master´s IP

9. Let´s create our first Pyspark notebook


create the first Pyspark notebook

10. Validate that is running well



Bonus: Check Spark UI

  • To access Spark UI you need to add another Firewall rule like the step 7. Open ports 8088, 4040, 9870 and 4041.

Create Spark UI rule

  • Click on Spark UI link got in our first notebook, you will get an ERR_NAME_NOT_RESOLVED error, just replace the URL to the master IP
e.g. http://3x.xxx.xx.x:8088/proxy/application_1542773664669_0001

Spark UI

Conclusion

In this article, I tried to deploy Jupyter in a Data Proc Cluster making more friendly to use PySpark in a real cluster. Please feel free if you have questions or suggestions for next articles.

See you in the next article! Happy Learning!

Views: 719

Tags: Apache Spark, Big Data, Google Cloud, Jupyter, Python

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service