This article was written by Manish Saraswat.
Source for picture: click here
This article is meant to help R users enhance their set of skills and learn Python for data science (from scratch). After all, R and Python are the most important programming languages a data scientist must know.
Python is a supremely powerful and a multi-purpose programming language. It has grown phenomenally in the last few years. It is used for web development, game development, and now data analysis / machine learning. Data analysis and machine learning is a relatively new branch in python.
For a beginner in data science, learning python for data analysis can be really painful. Why ? You try Googling "learn python," and you'll get tons of tutorials only meant for learning python for web development. How can you find a way then ?
In this tutorial, we'll be exploring the basics of python for performing data manipulation tasks. Alongside, we'll also look how you do it in R. This parallel comparison will help you relate the set of tasks you do in R to how you do it in python! And in the end, we'll take up a data set and practice our newly acquired python skills.
Note: This article is best suited for people who have a basic knowledge of R language.
Table of Contents:
1. Why learn Python (even if you already know R)
No doubt, R is tremendously great at what it does. In fact, it was originally designed for doing statistical computing and manipulations. Its incredible community support allows a beginner to learn R quickly.
But, python is catching up fast. Established companies and startups have embraced python at a much larger scale compared to R.
According to indeed.com (from Jan 2016 to November 2016), the number of job postings seeking "machine learning python" increased much faster (approx. 123%) than "machine learning in R" jobs. Do you know why ? It is because
2. Understanding Data Types and Structures in Python vs. R:
These programming languages understand the complexity of a data set based on its variables and data types. Yes! Let's say you have a data set with one million rows and 50 columns. How would these programming languages understand the data ?
Basically, both R and Python have pre-defined data types. The dependent and independent variables get classified among these data types. And, based on the data type, the interpreter allots memory for use. Python supports the following data types:
bit64package to read hexadecimal values.
factortype or a
charactertype. There exists a tiny difference between Boolean values in R and python. In R, Boolean are stored as TRUE and FALSE. In python, they are stored as True and False. There's a difference in the letter case.
Since R is a statistical computing language, all the functions to manipulate data and reading variables are available inherently. On the other hand, python hails all the data analysis / manipulation / visualization functions from external libraries. Python has several libraries for data manipulation and machine learning. The most important ones are:
In a way, python for a data scientist is largely about mastering the libraries stated above. However, there are many more advanced libraries which people have started using. Therefore, for practical purposes you should remember the following things:
list. It can be multidimensional. It can contain data of the same or multiple classes. In case of multiple classes, the coercion effect takes place.
data.frameand python uses the
Dataframefunction from the pandas library.
matrixfunction. In python, we use the
Until here, I hope you've understood the basics of data types and data structures in R and Python. Now, let's start working with them!