Who should read this blog:Someone who is new to linear regression.Someone who wants to understand the jargon around Linear RegressionCode Repository:https://github.com/DhruvilKarani/Linear-Regression-ExperimentsLinear regression is generally the first step into anyone’s Data Science journey. When you hear the words Linear and Regression, something like this pops up in your mind:X1, X2, ..Xn are the independent variables or features. W1, W2…Wn are the weights (learned by the model from the data). Y’ is the model prediction. For a set of say 1000 points, we have a table with 1000 rows, n columns for X and 1 column for Y. Our model learns the weights W from these 1000 points so that it can predict the dependent variable Y for an unseen point (a point for which Xs are available but not Y)For these seen points (1000 in this case), the actual value of the dependent variable (Y) and the model’s prediction of the dependent variable (denoted by Y’) are related as:Epsilon (denoted by e henceforth) is the residual error. It captures the difference between models prediction and the actual value of Y. One thing to remember about e is that it follows a normal distribution with 0 mean.The weights are learned using OLS or Ordinary Least Squared fitting. Meaning, The cumulative squared error, defined as,Is minimized. Why do we square the errors and then add? Two reasons.If the residual error for a point is -1 and for the other, it is 1 then merely adding them gives 0 error which means the line fits perfectly to the points. This is not true.Squaring leads to more importance given to larger errors and less to smaller errors. Intuitively, model weights quickly update to minimize larger errors more than smaller ones. Quite often, in the excitement of learning new and advanced models, we usually do not fully explore this model. In this blog, we’ll look at how to analyze and diagnose a linear regression model. This blog is going to be as intuitive as possible. Let’s talk about the model. There are three main assumptions linear regression makes:The independent variables have a linear relationship with the dependent variable.The variance of the dependent variable is uniform across all combinations of XsThe error term e associated with Y and Y’ is independent and identically distributed.Do not worry if you don’t correctly understand the above lines. I am yet to simplify them. Linear RelationshipSeems quite intuitive. If the independent variables do not have a linear relationship with the dependent variables, there’s no point modeling them using LINEAR regression. So what should one do if that is not the case? Consider a dataset with one independent variable X and dependent variable Y. In this case, Y varies as the square of X (some random noise is added to Y to make it look more realistic).If we fit and plot a linear regression line, we can see that it isn’t a good fit. The MSE (mean squared error) is 0.0102So what we do here is we transform X such that Y and this transformed X follows a linear relationship. Take a look at the picture below. Y and X might not have a linear relation. However, Y and X^2 do have a linear relation.Next, we build the model, generate the predictions and reverse transform it. Take a look at the code and the plots below to get an idea. The MSE here is 0.0053 (almost half the previous one)Isn’t is evident which one fits better. I hope it is a bit more clear why linear relationships are needed. Let’s move on to the next assumption.The variance of the dependent variable is uniform across all combinations of XsFormally speaking, we need something called homoscedasticity. In simple terms, it means that the residuals must have constant variance. Let’s visualize this. Later I’ll explain why it is essential.If you notice carefully, the variance among the Y values increases from left to right like a trumpet. Meaning the Y values for lower X values do not vary much concerning the regression line, unlike the ones to the right. We call it heteroscedasticity and is something you want to avoid. Why? Well, if there is a pattern among the residuals, like this one (for the above plot)It generally means that the model is too simple for the data. The model is unable to capture all the patterns present in the data. When we achieve homoscedasticity, residuals look something like this. Another reason to avoid heteroscedasticity is to save us from unbias results in significance tests. We’ll look at these tests in details.As you can see, the residuals are entirely random. One can hardly see any pattern. Now comes the last and the final assumption.The error term e associated with Y and Y’ is independent and identically distributedSounds similar to the previous one? It kind of does. However, it is a little different. Previously, our residuals had growing variance but there we still independent. One residual did not have anything to do with the other. Here, we analyze what if one residual error has some dependency with the other. Consider the following plot whose data is generated by Y = X + noise (random number). Now, this noise accumulates over different values of X. Meaning noise for an X is a random number + the noise of the previous noise. We deliberately introduce this additive noise for the sake of our experiment.A linear fit seems a good choice. Let’s check the residual errors.They do not look entirely random. Is there some metric we can compute to validate our claim. It turns out we can calculate something called Autocorrelation. What is autocorrelation? We know that correlation measures the degree of linear relationship between two variables, say A and B. Autocorrelation measures the correlation of a variable with itself. For example, we want to measure how dependent a particular value of A correlates with the value of A some t steps back. More on autocorrelation: https://medium.com/@karanidhruvil/time-series-analysis-3-different-ways-bb52ab1a15b2In our example, it turns out to be 0.945 which indicates some dependency. Now, why do we need errors to be independent? Again, this means the linear model fails to capture complex patterns in the data. Such type of patterns may frequently occur in time series data (where X is time, and Y is a property that varies with time. Stock prices for instance). The unaccounted patterns here could be some seasonality or trends.I hope the three assumptions are a bit clear. Now how do we evaluate our model? Let’s take a look at some metrics. Evaluating a ModelPreviously, we defined MSE to calculate the errors committed by the model. However, if I tell you that for some data and some model the MSE is 23.223. Is this information alone enough to say something about the quality of our fit? How do we know if it’s the best our model can do? We need some benchmark to evaluate our model against. Hence, we have a metric called R squared (R^2).Let’s get the terms right. We know MSE. However, what is TSE or Total Squared Error? Suppose we had no X. We have Y, and we asked to model a line to fit these Y values such that the MSE minimizes. Since we have no X, our line would be of the form Y’ = a, where a is a constant. If we substitute Y’ for a in the MSE equation, and minimize it by differentiating with respect to a and set equal to zero, it turns out that a = mean(Y) gives the least error. Think about this – the line Y’ = a can be understood as the baseline model for our data. Addition of any independent variable X improves our model. Our model cannot be worse than this baseline model. If our X didn’t help to improve the model, it’s weight or coefficients would be 0 during MSE minimization. This baseline model provides a reference point. Now come back to R squared and take a look at the expression. If our model with all the X and all the Y produces an error same as the baseline model (TSE), R squared = 1-1 = 0. This is the worst case. On the opposite, if MSE =0, R squared = 1 which is the best case scenario.Now let’s take a step back and think about the case when we add more independent variables to our data. How would the model respond to it? Suppose we are trying to predict house prices. If we add the area of the house to our model as an independent variable, our R square could increase. It is obvious. The variable does affect house prices. Suppose we add another independent variable. Something garbage, say random numbers. Can our R square increase? Can it decrease?Now, if this garbage variable is helping minimize MSE, it’s weight or coefficient is non zero. If it isn’t, the weight is zero. If so, we get back the previous model. We can conclude that adding new independent variable at worst does nothing. It won’t degrade the model R squared. So if I keep adding new variables, I should get a better R squared. And I will. However, it doesn’t make sense. Those features aren’t reliable. Suppose if those set of random numbers were some other set of random numbers, our weights would change. You see, it is all up to chance. Remember that we have a sample of data points on which we build a model. It needs to be robust to new data points out of the sample. That’s why we introduce something called adjusted R squared. Adjusted R squared penalizes any addition of independent variables that do not add a significant improvement to the model. You usually use this metric to compare models after the addition of new features.n is the number of points, k is the number of independent variables. If you add features without a significant increase in R squared, the adjusted R squared decreases.So now we know something about linear regression. We dive deeper in the second part of the blog. In the next blog, we look at regularization and assessment of coefficients.Read more here. See More

In addition to being the sexiest job of the twenty-first century, Data Science is new electricity as quoted by Andrew Ng. A lot of professionals from various disciplines and domain are looking to make a transition into the field of analytics and use Data Science to solve various problems across multiple channels. Being an inter-disciplinary study, one could easily mine data for various operations and help decision-makers make relevant conclusions to achieve sustainable growth.The field of Data Science comprises of various components such as Data Analysis, Machine Learning, Deep Learning, and Business Intelligence and so on. The implication differs according to the business needs and its workflow. In a corporate firm, a Data Science project is always comprised of people with diverse skillsets as various nitty-gritty need to be taken care of by different people.Now the question may arise – What is Data Science? Data Science is nothing but a way to use several tools and techniques to mine relevant data for a business to derive insights and take appropriate decisions. Analytics could be divided into Descriptive and Predictive Analytics. While descriptive analytics deals with cleaning, munging, wrangling and presenting the data in the form of charts and graphs to the stakeholders, on the other hand, predictive analytics is about building robust models which would predict future scenarios.In this blog, we would talk about exploratory data analysis which in one sense is a descriptive analysis process and is one of the most important parts in a Data Science project. Before you start building models, your data should be accurate with no anomalies, duplicates, missing values and so on. It should also be properly analysed to find relevant features which would make the best prediction. Exploratory Data Analysis in Python Python is one of the most flexible programming languages which has a plethora of uses. There is a debate between Python and R as to which one is best for Data Science. However, in my opinion, there is no fixed language and it completely depends on the individual. I personally prefer Python because of its ease of use and its broad range of features. Certainly, in exploring the data, Python provides a lot of intuitive libraries to work with and analyse the data from all directions.To perform exploratory Data Analysis, we would use a house pricing dataset which is a regression problem. The dataset could be downloaded from here. Below is the description of the columns in the data.SalePrice – This is our target variable which we need to predict based on the rest of the features. In dollars, the price of each property is defined.MSSubClass – It is the class of the building.MSZoning – The zones are classified in this column.LotFrontage – With respect to the property, the values of the linear feet of the street connected to it is provided here.LotArea – The size of the lot in square feet.Street – The road access type.Alley – The alley access type.LotShape – The property shape in general.LandContour – The property flatness is defined by this column.Utilities – The utilities type that are available.LotConfig – The configuration lot.LandSlope – The property slope.Neighborhood – Within the Amies city limits, the physical locations.Condition1 – Main road or the railroad proximity.Condition2 – If a second is present, then the main road or the railroad proximity.BldgType – The dwelling type.HouseStyle – The dwelling style.OverallQual – The finish and the overall material quality.OverallCond – The rating of the overall condition.YearBuilt – The date of the original construction.YearRemodAdd – The date of the remodel.RoofStyle – The roof type.RoofMatl – The material of the roof.Exterior1st – The exterior which covers the house.Exterior2nd – If more than one material is present, then the exterior which covers the house.MasVnrType – The type of the Masonry veneer.MasVnrArea – The area of the Masonry veneer.ExterQual – The quality of the exterior material.ExterCond – On the exterior, the material’s present condition.Foundation – The foundation type.BsmtQual – The basement height.BsmtCond – The basement condition in general.BsmtExposure – The basement walls in the garden or the walkout.BsmtFinType1 – The finished area quality of the basement.BsmtFinSF1 – The square feet area of the Type 1 finished.BsmtFinType2 – If present, then the second finished product area quality.BsmtFinSF2 – The square feet area of the Type 2 finished.BsmtUnfSF – The square feet of the unfinished area of the basement.TotalBsmtSF – The area of the basement.Heating – The heating type.HeatingQC – The condition and the quality of heating.CentralAir – The central air conditioning.Electrical – The electrical system.1stFlrSF – The area of the first floor.2ndFlrSF – The area of the second floor.LowQualFinSF – The area of all low quality finished floors.GrLivArea – The ground living area.BsmtFullBath – The full bathrooms of the basement.BsmtHalfBath – The half bathrooms of the basement.FullBath – The above grade full bathrooms.HalfBath – The above grade half bathrooms.Bedroom – The above basement level bathroom numbers.Kitchen – The kitchen numbers.KitchenQual – The quality of the kitchen.TotRmsAbvGrd – Without the bathrooms, the number of rooms above the ground.Functional – The rating of home functionality.Fireplaces – The fireplace numbers.FireplaceQu – The quality of the fireplace.GarageType – The location of the Garage.GarageYrBlt – The garage built year.GarageFinish – The garage’s interior finish.GarageCars – In car capacity, the size of the garage.GarageArea – The area of the garage.GarageQual – The quality of the garage.GarageCond – The condition of the garage.WoodDeckSF – The area of the wood deck.OpenPorchSF – The area of the open porch.EnclosedPorch – The area of the enclosed porch area.3SsnPorch – The area of the three season porch.ScreenPorch – The area of the screen porch area.PoolArea – The area of the pool.PoolQC – The quality of the pool.Fence – The quality of the fence.MiscFeature – Miscellaneous features.MiscVal – The miscellaneous feature value in dollars.MoSold – The selling month.YrSold – The selling year.SaleType – The type of the sale.SaleCondition – The sale condition. As you can see, it is a high dimensional dataset with a lot of variables but all these columns would not be used in our prediction because then the model could suffer from multicollinearity problem. Below are some of the basic exploratory Data Analysis steps we could perform on this dataset.source: Cambridge Spark The libraries would be imported using the following commands – Import pandas as pdImport seaborn as snsImport matplotlib.pyplot as plt To read the dataset which is a CSV(Comma separated value format) we would use the read csv function of pandas and load it into a data frame. df = pd.read_csv(‘…/input/train.csv’)The head() command would display the first five rows of the dataset.The info() command would give an idea about the number of values each column along with their datatypes.To drop irrelevant features and columns with more than 30 percent missing values, the below code is used. df2 = df[[column for column in df if df[column].count() / len(df) >= 0.3]] del df2[‘Id’] print(“List of dropped columns:”, end=” “) for c in df.columns: if c not in df2.columns: print(c, end=”, “) print(‘\n’) df = df2 The describe() command would give a statistical description of all the numeric features. The statistical value includes count, mean, standard deviation, minimum, the first quartile, the mean, the third quartile, and the maximum.dtypes gives the datatypes of all the columns.To find the correlation between each feature, the corr() command is used. It not only helps in identifying which feature column has more variance with the target but also helps to observe multicollinearity and avoid it.There are other operations such as df.value_counts() which gives the count of every unique value in each feature. Moreover, to fill the missing values, we could use the fillna command.The entire notebook is available here.For efficient analysis of data, other than having the skills to use tools and techniques, what matters the most is your intuition about the data. Understanding the problem statement is the first step of any Data Science project followed by the necessary questions that could be formulated from it. Exploratory Data Analysis could be performed well only when you know what the questions that need to be answered are and hence the relevancy of the data is validated.I have seen professionals jumping into Machine Learning, Deep Learning and the focusing more on the state of the art models, however, they forget or skip the most rigorous and time-consuming part which is exploratory data analysis. Without proper EDA, it is difficult to get good prediction and your model could suffer from underfitting or overfitting. A model under fits when it is too simple and has high bias resulting in both high training and test set errors. While an overfit model has high variance and fails to generalize well to an unknown set. If you want to read more about data science, you can read our blogs here.See More

In addition to being the sexiest job of the twenty-first century, Data Science is new electricity as quoted by Andrew Ng. A lot of professionals from various disciplines and domain are looking to make a transition into the field of analytics and use Data Science to solve various problems across multiple channels. Being an inter-disciplinary study, one could easily mine data for various operations and help decision-makers make relevant conclusions to achieve sustainable growth.The field of Data Science comprises of various components such as Data Analysis, Machine Learning, Deep Learning, and Business Intelligence and so on. The implication differs according to the business needs and its workflow. In a corporate firm, a Data Science project is always comprised of people with diverse skillsets as various nitty-gritty need to be taken care of by different people.Now the question may arise – What is Data Science? Data Science is nothing but a way to use several tools and techniques to mine relevant data for a business to derive insights and take appropriate decisions. Analytics could be divided into Descriptive and Predictive Analytics. While descriptive analytics deals with cleaning, munging, wrangling and presenting the data in the form of charts and graphs to the stakeholders, on the other hand, predictive analytics is about building robust models which would predict future scenarios.In this blog, we would talk about exploratory data analysis which in one sense is a descriptive analysis process and is one of the most important parts in a Data Science project. Before you start building models, your data should be accurate with no anomalies, duplicates, missing values and so on. It should also be properly analysed to find relevant features which would make the best prediction. Exploratory Data Analysis in Python Python is one of the most flexible programming languages which has a plethora of uses. There is a debate between Python and R as to which one is best for Data Science. However, in my opinion, there is no fixed language and it completely depends on the individual. I personally prefer Python because of its ease of use and its broad range of features. Certainly, in exploring the data, Python provides a lot of intuitive libraries to work with and analyse the data from all directions.To perform exploratory Data Analysis, we would use a house pricing dataset which is a regression problem. The dataset could be downloaded from here. Below is the description of the columns in the data.SalePrice – This is our target variable which we need to predict based on the rest of the features. In dollars, the price of each property is defined.MSSubClass – It is the class of the building.MSZoning – The zones are classified in this column.LotFrontage – With respect to the property, the values of the linear feet of the street connected to it is provided here.LotArea – The size of the lot in square feet.Street – The road access type.Alley – The alley access type.LotShape – The property shape in general.LandContour – The property flatness is defined by this column.Utilities – The utilities type that are available.LotConfig – The configuration lot.LandSlope – The property slope.Neighborhood – Within the Amies city limits, the physical locations.Condition1 – Main road or the railroad proximity.Condition2 – If a second is present, then the main road or the railroad proximity.BldgType – The dwelling type.HouseStyle – The dwelling style.OverallQual – The finish and the overall material quality.OverallCond – The rating of the overall condition.YearBuilt – The date of the original construction.YearRemodAdd – The date of the remodel.RoofStyle – The roof type.RoofMatl – The material of the roof.Exterior1st – The exterior which covers the house.Exterior2nd – If more than one material is present, then the exterior which covers the house.MasVnrType – The type of the Masonry veneer.MasVnrArea – The area of the Masonry veneer.ExterQual – The quality of the exterior material.ExterCond – On the exterior, the material’s present condition.Foundation – The foundation type.BsmtQual – The basement height.BsmtCond – The basement condition in general.BsmtExposure – The basement walls in the garden or the walkout.BsmtFinType1 – The finished area quality of the basement.BsmtFinSF1 – The square feet area of the Type 1 finished.BsmtFinType2 – If present, then the second finished product area quality.BsmtFinSF2 – The square feet area of the Type 2 finished.BsmtUnfSF – The square feet of the unfinished area of the basement.TotalBsmtSF – The area of the basement.Heating – The heating type.HeatingQC – The condition and the quality of heating.CentralAir – The central air conditioning.Electrical – The electrical system.1stFlrSF – The area of the first floor.2ndFlrSF – The area of the second floor.LowQualFinSF – The area of all low quality finished floors.GrLivArea – The ground living area.BsmtFullBath – The full bathrooms of the basement.BsmtHalfBath – The half bathrooms of the basement.FullBath – The above grade full bathrooms.HalfBath – The above grade half bathrooms.Bedroom – The above basement level bathroom numbers.Kitchen – The kitchen numbers.KitchenQual – The quality of the kitchen.TotRmsAbvGrd – Without the bathrooms, the number of rooms above the ground.Functional – The rating of home functionality.Fireplaces – The fireplace numbers.FireplaceQu – The quality of the fireplace.GarageType – The location of the Garage.GarageYrBlt – The garage built year.GarageFinish – The garage’s interior finish.GarageCars – In car capacity, the size of the garage.GarageArea – The area of the garage.GarageQual – The quality of the garage.GarageCond – The condition of the garage.WoodDeckSF – The area of the wood deck.OpenPorchSF – The area of the open porch.EnclosedPorch – The area of the enclosed porch area.3SsnPorch – The area of the three season porch.ScreenPorch – The area of the screen porch area.PoolArea – The area of the pool.PoolQC – The quality of the pool.Fence – The quality of the fence.MiscFeature – Miscellaneous features.MiscVal – The miscellaneous feature value in dollars.MoSold – The selling month.YrSold – The selling year.SaleType – The type of the sale.SaleCondition – The sale condition. As you can see, it is a high dimensional dataset with a lot of variables but all these columns would not be used in our prediction because then the model could suffer from multicollinearity problem. Below are some of the basic exploratory Data Analysis steps we could perform on this dataset.source: Cambridge Spark The libraries would be imported using the following commands – Import pandas as pdImport seaborn as snsImport matplotlib.pyplot as plt To read the dataset which is a CSV(Comma separated value format) we would use the read csv function of pandas and load it into a data frame. df = pd.read_csv(‘…/input/train.csv’)The head() command would display the first five rows of the dataset.The info() command would give an idea about the number of values each column along with their datatypes.To drop irrelevant features and columns with more than 30 percent missing values, the below code is used. df2 = df[[column for column in df if df[column].count() / len(df) >= 0.3]] del df2[‘Id’] print(“List of dropped columns:”, end=” “) for c in df.columns: if c not in df2.columns: print(c, end=”, “) print(‘\n’) df = df2 The describe() command would give a statistical description of all the numeric features. The statistical value includes count, mean, standard deviation, minimum, the first quartile, the mean, the third quartile, and the maximum.dtypes gives the datatypes of all the columns.To find the correlation between each feature, the corr() command is used. It not only helps in identifying which feature column has more variance with the target but also helps to observe multicollinearity and avoid it.There are other operations such as df.value_counts() which gives the count of every unique value in each feature. Moreover, to fill the missing values, we could use the fillna command.The entire notebook is available here.For efficient analysis of data, other than having the skills to use tools and techniques, what matters the most is your intuition about the data. Understanding the problem statement is the first step of any Data Science project followed by the necessary questions that could be formulated from it. Exploratory Data Analysis could be performed well only when you know what the questions that need to be answered are and hence the relevancy of the data is validated.I have seen professionals jumping into Machine Learning, Deep Learning and the focusing more on the state of the art models, however, they forget or skip the most rigorous and time-consuming part which is exploratory data analysis. Without proper EDA, it is difficult to get good prediction and your model could suffer from underfitting or overfitting. A model under fits when it is too simple and has high bias resulting in both high training and test set errors. While an overfit model has high variance and fails to generalize well to an unknown set. If you want to read more about data science, you can read our blogs here.See More

]]>

]]>

]]>

]]>

]]>

]]>

A Hearty Welcome to You!I am so thrilled to welcome you to the absolutely awesome world of data science. It is an interesting subject, sometimes difficult, sometimes a struggle but always hugely rewarding at the end of your work. While data science is not as tough as, say, quantum mechanics, it is not high-school algebra either.It requires knowledge of Statistics, some Mathematics (Linear Algebra, Multivariable Calculus, Vector Algebra, and of course Discrete Mathematics), Operations Research (Linear and Non-Linear Optimization and some more topics including Markov Processes), Python, R, Tableau, and basic analytical and logical programming skills..Now if you are new to data science, that last sentence might seem more like pure Greek than simple plain English. Don’t worry about it. If you are studying the Data Science course at Dimensionless Technologies, you are in the right place. This course covers the practical working knowledge of all the topics, given above, distilled and extracted into a beginner-friendly form by the talented course material preparation team.This course has turned ordinary people into skilled data scientists and landed them with excellent placement as a result of the course, so, my basic message is, don’t worry. You are in the right place and with the right people at the right time.What is Data Science?To quote Wikipedia:Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. Data science is the same concept as data mining and big data: “use the most powerful hardware, the most powerful programming systems, and the most efficient algorithms to solve problems.”From SourceMore Greek again, you might say.Hence my definition:Data Science is the art of extracting critical knowledge from raw data that provides significant increases in profits for your organization.We are surrounded by data (Google ‘data deluge’ and you’ll see what I mean). More data has been created in the last two years that in the last 5,000 years of human existence.The companies that use all this data to gain insights into their business and optimize their processing power will come out on top with the maximum profits in their market.Companies like Facebook, Amazon, Microsoft, Google, and Apple (FAMGA), and every serious IT enterprise have realized this fact.Hence the demand for talented data scientists.I have much more to share with you on this topic, but to keep this article short, I’ll just share the links below which you can go through in your free time (everyone’s time is valuable because it is a strictly finite resource):You can refer to An Introduction to Data Science.Article OrganizationFrom PexelsNow as I was planning this article a number of ideas came to my mind. I thought I could do a textbook-like reference to the field, with Python examples.But then I realized that true competence in data science doesn’t come when you read an article.True competence in data science begins when you take the programming concepts you have learned, type them into a computer, and run it on your machine.And then; of course, modify it, play with it, experiment, run single lines by themselves, see for yourselves how Python and R work.That is how you fall in love with coding in data science.At least, that’s how I fell in love with simple C coding. Back in my UG in 2003. And then C++. And then Java. And then .NET. And then SQL and Oracle. And then… And then… And then… And so on.If you want to know, I first started working in back-propagation neural networks in the year 2006. Long before the concept of data science came along! Back then, we called it artificial intelligence and soft computing. And my final-year project was coded by hand in Java.Having come so far, what have I learned?That it’s a vast massive uncharted ocean out there.The more you learn, the more you know, the more you become aware of how little you know and how vast the ocean is.But we digress!To get back to my point –My final decision was to construct a beginner project, explain it inside out, and give you source code that you can experiment with, play with, enjoy running, and modify here and there referring to the documentation and seeing what everything in the code actually does.Kaggle – Your Home For Data Scienceh If you are in the data science field, this site should be on your browser bookmark bar. Even in multiple folders, if you have them.Kaggle is the go-to site for every serious machine learning practitioner. They hold competitions in data science (which have a massive participation), have fantastic tutorials for beginners, and free source code open-sourced under the Apache license (See this link for more on the Apache open source software license – don’t skip reading this, because as a data scientist this is something about software products that you must know).As I was browsing this site the other day, a kernel that was attracting a lot of attention and upvotes caught my eye.This kernel is by a professional data scientist by the name of Fatma Kurçun from Istanbul (the funny-looking ç symbol is called c with cedilla and is pronounced with an s sound).It was quickly clear why it was so popular. It was well-written, had excellent visualizations, and a clear logical train of thought. Her professionalism at her art is obvious.Since it is an open source Apache license released software, I have modified her code quite a lot (diff tool gives over 100 changes performed) to come up with the following Python classification example.But before we dive into that, we need to know what a data science project entails and what classification means.Let’s explore that next.Classification and Data Science So supervised classification basically means mapping data values to a category defined in advance. In the image above, we have a set of customers who have certain data values (records). So one dot above corresponds with one customer with around 10-20 odd fields.Now, how do we ascertain whether a customer is likely to default on a loan, and which customer is likely to be a non-defaulter? This is an incredibly important question in the finance field! You can understand the word, “classification”, here. We classify a customer into a defaulter (red dot) class (category) and a non-defaulter (green dot) class.This problem is not solvable by standard methods. You cannot create and analyze a closed-form solution to this problem with classical methods. But – with data science – we can approximate the function that captures or models this problem, and give a solution with an accuracy range of 90-95%. Quite remarkable!Now, again we can have a blog article on classification alone, but to keep this article short, I’ll refer you to the following excellent articles as references:Link 1 and Link 2 Steps involved in a Data Science ProjectA data science project is typically composed of the following components:Defining the ProblemCollecting Data from SourcesData PreprocessingFeature EngineeringAlgorithm SelectionHyperparameter TuningRepeat steps 4–6 until error levels are low enough.Data VisualizationInterpretation of ResultsIf I were to explain each of these terms – which I could – but for the sake of brevity – I can ask you to refer to the following articles:and:Steps to perform data science with Python- MediumAt some time in your machine learning career, you will need to go through the article above to understand what a machine learning project entails (the bread-and-butter of every data scientist).Jupyter NotebooksFrom WikipediaTo run the exercises in this section, we use a Jupyter notebook. Jupyter is short for Julia, Python, and R. This environment uses kernels of any of these languages and has an interactive format. It is commonly used by data science professionals and is also good for collaboration and for sharing work.To know more about Jupyter notebooks, I can suggest the following article (read when you are curious or have the time): Data Science Libraries in Python The standard data science stack for Python has the scikit-learn Python library as a basic lowest-level foundation. The scikit-learn python library is the standard library in Python most commonly used in data science. Along with the libraries numpy, pandas, matplotlib, and sometimes seaborn as well this toolset is known as the standard Python data science stack. To know more about data science, I can direct you to the documentation for scikit-learn – which is excellent. The text is lucid, clear, and every file contains a working live example as source code. Refer to the following links for more:Link 1 and Link 2This last link is like a bible for machine learning in Python. And yes, it belongs on your browser bookmarks bar. Reading and applying these concepts and running and modifying the source code can help you go a long way towards becoming a data scientist.And, for the source of our purposeOur Problem DefinitionThis is the classification standard data science beginner problem that we will consider. To quote Kaggle.com:The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy. From: KaggleWe’ll be trying to predict a person’s category as a binary classification problem – survived or died after the Titanic sank.So now, we go through the popular source code, explaining every step.Import LibrariesThese lines given below:12345import pandas as pdimport numpy as npimport matplotlib.pyplot as plt;import seaborn as sns%matplotlib inline Are standard for nearly every Python data stack problem. Pandas refers to the data frame manipulation library. NumPy is a vectorized implementation of Python matrix manipulation operations that are optimized to run at high speed. Matplotlib is a visualization library typically used in this context. Seaborn is another visualization library, at a little higher level of abstraction than matplotlib.The Problem Data SetWe read the CSV file:1train = pd.read_csv('../input/train.csv') Exploratory Data AnalysisNow, if you’ve gone through the links given in the heading ‘Steps involved in Data Science Projects’ section, you’ll know that real-world data is messy, has missing values, and is often in need of normalization to adjust for the needs of our different scikit-learn algorithms. This CSV file is no different, as we see below:Missing DataThis line uses seaborn to create a heatmap of our data set which shows the missing values:1sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')Output:1<matplotlib.axes._subplots.AxesSubplot at 0x7f3b5ed98ef0> InterpretationThe yellow bars indicate missing data. From the figure, we can see that a fifth of the Age data is missing. And the Cabin column has so many missing values that we should drop it.Graphing the Survived vs. the Deceased in the Titanic shipwreck:12sns.set_style('whitegrid')sns.countplot(x='Survived',data=train,palette='RdBu_r') Output:1<matplotlib.axes._subplots.AxesSubplot at 0x7f3b54fe2390>As we can see, in our sample of the total data, more than 500 people lost their lives, and less than 350 people survived (in the sample of the data contained in train.csv).When we graph Gender Ratio, this is the result.12sns.set_style('whitegrid')sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')Output1<matplotlib.axes._subplots.AxesSubplot at 0x7f3b54f49da0>Over 400 men died, and around 100 survived. For women, less than a hundred died, and around 230 odd survived. Clearly, there is an imbalance here, as we expect.Data CleaningThe missing age data can be easily filled with the average of the age values of an arbitrary category of the dataset. This has to be done since the classification algorithm cannot handle missing values and will be error-ridden if the data values are not error-free.12plt.figure(figsize=(12, 7))sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')Output1<matplotlib.axes._subplots.AxesSubplot at 0x7f3b54d132e8>We use these average values to impute the missing values (impute – a fancy word for filling in missing data values with values that allow the algorithm to run without affecting or changing its performance).12345678910111213def impute_age(cols): Age = cols[0] Pclass = cols[1] if pd.isnull(Age): if Pclass == 1: return 37 elif Pclass == 2: return 29 else: return 24 else: return Age 1train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1) Missing values heatmap:1sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis') Output:1<matplotlib.axes._subplots.AxesSubplot at 0x7f3b54a0d0b8> We drop the Cabin column since its mostly empty.1train.drop('Cabin',axis<strong>=</strong>1,inplace<strong>=</strong><strong>True</strong>)Convert categorical features like Sex and Name to dummy variables using pandas so that the algorithm runs properly (it requires data to be numeric)..1train.info() Output:123456789101112131415161718<class 'pandas.core.frame.DataFrame'> Int64Index: 889 entries, 0 to 890Data columns (total 11 columns): PassengerId 889 non-null int64Survived 889 non-null int64Pclass 889 non-null int64Name 889 non-null objectSex 889 non-null objectAge 889 non-null float64SibSp 889 non-null int64Parch 889 non-null int64Ticket 889 non-null objectFare 889 non-null float64Embarked 889 non-null object dtypes: float64(2), int64(5), object(4)memory usage: 83.3+ KB More Data PreprocessingWe use one-hot encoding to convert the categorical attributes to numerical equivalents. One-hot encoding is yet another data preprocessing method that has various forms. For more information on it, see the link 1234sex = pd.get_dummies(train['Sex'],drop_first=True)embark = pd.get_dummies(train['Embarked'],drop_first=True)train.drop(['Embarked','Name','Ticket'],axis=1,inplace=True)train = pd.concat([train,sex,embark],axis=1) Finally, we check the heatmap of features again:1sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis') Output1<matplotlib.axes._subplots.AxesSubplot at 0x7f3b54743ac8> No missing data and all text converted accurately to a numeric representation means that we can now build our classification model.Building a Gradient Boosted Classifier modelGradient Boosted Classification Trees are a type of ensemble model that has consistently accurate performance over many dataset distributions. I could write another blog article on how they work but for brevity, I’ll just provide the link here and link 2 here:We split our data into a training set and test set.123from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1), train['Survived'], test_size=0.10, random_state=0) Training:123from sklearn.ensemble import GradientBoostingClassifiermodel = GradientBoostingClassifier()model.fit(X_train,y_train) Output:12GradientBoostingClassifier(criterion='friedman_mse', init=None, learning_rate=0.1, loss='deviance', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None, presort='auto', random_state=None, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False) Predicting:123predictions = model.predict(X_test) predictions Output1234array([0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0,0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0]) PerformanceThe performance of a classifier can be determined by a number of ways. Again, to keep this article short, I’ll link to the pages that explain the confusion matrix and the classification report function of scikit-learn and of general classification in data science:Confusion Matrix Predictive Model EvaluationA wonderful article by one of our most talented writers. Skip to the section on the confusion matrix and classification accuracy to understand what the numbers below mean.For a more concise, mathematical and formulaic description, read here 123from sklearn.metrics import classification_report,confusion_matrix print(confusion_matrix(y_test,predictions)) 12[[89 16][29 44]]So as not make this article too disjointed, let me explain at least the confusion matrix to you.The confusion matrix has the following form:[[ TP FP ][ FN TN ]]The abbreviations mean:TP – True Positive – The model correctly classified this person as deceased.FP – False Positive – The model incorrectly classified this person as deceased.FN – False Negative – The model incorrectly classified this person as a survivorTN – True Negative – The model correctly classified this person as a survivor.So, in this model published on Kaggle, there were:89 True Positives16 False Positives29 False Negatives44 True NegativesClassification ReportYou can refer to the link here-to learn everything you need to know about the classification report.1print(classification_report(y_test,predictions)) 123456 precision recall f1-score support 0 0.75 0.85 0.80 105 1 0.73 0.60 0.66 73 micro avg 0.75 0.75 0.75 178 macro avg 0.74 0.73 0.73 178weighted avg 0.75 0.75 0.74 178 So the model, when used with Gradient Boosted Classification Decision Trees, has a precision of 75% (the original used Logistic Regression).Wrap-UpI have attached the dataset and the Python program to this document, you can download it by clicking on these links. Run it, play with it, manipulate the code, view the scikit-learn documentation. As a starting point, you should at least:Use other algorithms (say LogisticRegression / RandomForestClassifier a the very least)Refer the following link for classifiers to use: Sections 1.1 onwards – every algorithm that has a ‘Classifier’ ending in its name can be used – that’s almost 30-50 odd models!Try to compare performances of different algorithmsTry to combine the performance comparison into one single program, but keep it modular.Make a list of the names of the classifiers you wish to use, apply them all and tabulate the results. Refer to the following link:Use XGBoost instead of Gradient BoostingTitanic Training Dataset (here used for training and testing):titanic.csv – DownloadAddress of my GitHub Public Repo with the Notebook and code used in this article:Github CodeClone with Git (use TortoiseGit for simplicity rather than the command-line) and enjoy.To use Git, take the help of a software engineer or developer who has worked with it before. I’ll try to cover the relevance of Git for data science in a future article.But for now, refer to the following article hereYou can install Git from Git-SCM and TortoiseGit To clone,Install Git and TortoiseGit (the latter only if necessary)Open the command line with Run… cmd.exeCreate an empty directory.Copy paste the following string into the command prompt and watch the magic after pressing Enter: “git clone https://github.com/thomascherickal/datasciencewithpython-article-src.git” without the double quotes, of course.Use Anaconda (a common data science development environment with Python,, R, Jupyter, and much more) for best results.Cheers! All the best into your wonderful new adventure of beginning and exploring data science! Learning done right can be awesome fun! (Unsplash) If you want to read more about data science, read our Data Science BlogsSee More