Is not a secret that Data Science tools like Jupyter, Apache Zeppelin or the more recently launched Cloud Data Lab and Jupyter Lab are a must be known for the day by day work so How could be combined the power of easily developing models and the capacity of computation of a Big Data Cluster? Well in this article I will share very simple step to start using Jupyter notebooks for PySpark in a Data Proc Cluster in GCP.
1. Have a Google Cloud account (Just log in with your Gmail and automatically get $300 of credit for one year) 
2. Create a new project with your favorite name
2. Get the equivalent command line simulating the creation process with your own cluster´ size. I’m going to set basic specs:
Important: You should click Advance options and change Image to 1.3 Debian 9 to make beta parameters works.
3. Get equivalent command line
4. Close the simulation and click to Activate Cloud Shell
5. Modify your command adding and run (could take several minutes)
gcloud dataproc clusters to gcloud beta dataproc clusters
gcloud beta dataproc clusters create cluster-jupyter — subnet default — zone europe-west1-d — master-machine-type n1-standard-2 — master-boot-disk-size 300 — num-workers 2 — worker-machine-type n1-standard-2 — worker-boot-disk-size 200 — optional-components=ANACONDA,JUPYTER — image-version 1.3-deb9 — project jupyter-cluster-223203
6. Allow incoming traffic for Jupyter port, search for the firewall rules in the landing page and create a rule.
7. Define the Firewall rule opening port 8123 and save.
8. Enter your Jupyter notebook! (you need your master IP and add the jupyter default port e.g. http://30.195.xxx.xx:8123 )
9. Let´s create our first Pyspark notebook
10. Validate that is running well
Bonus: Check Spark UI
In this article, I tried to deploy Jupyter in a Data Proc Cluster making more friendly to use PySpark in a real cluster. Please feel free if you have questions or suggestions for next articles.
See you in the next article! Happy Learning!