I have mentioned that every Machine Learning process is built from several steps like:
Let’s review them one by one. I should mention that you will be able to find information over the internet that the number of steps is different from what you see on this blog. For example you can separate building and testing a model but at the end you need to do this no matter if it is one or more steps. Same you can say about the testing and evaluating your model.
Well. This is the most important part of the process! At least from the business (or problem solving) point of view. Please do not be angry I am using the word “business” as someone has to finally pay for your work as a data scientist.
Think about it. Imagine you are a care driver and see a traffic light. There are three colors – red, yellow and green. When the light is red you should wait. When the light is yellow you should be carefull and not enter the crossroads. When the light is green you can drive (safely). There is fourth state of the light as well – the red and yellow lights are on which means that green will be in a moment. The yellow can pulse which is an error state. You also know the sequence: green -> yellow -> red -> red and yellow -> green…
Now try to implement a system that by knowing the light color tells you whether you can go, prepare to go or wait. Do you need to create a Machine Learning model? Or maybe a neural network? No – the system is just a pretty simple an algorithm based on few rules.
Now you know – you would like to solve some problem. The better it is defined the greater are chances if the success. It can be a simple question like:
The problem can be however more complicated. Let’s look on the picture below. Guess which one is a chihuahua and which one is a blueberry muffin.
We as a humans can see the difference but for a program this could be a tricky and extremely challenging. What about a tasks that says: “Is this body cell malignant or healthy?” Machine Learning helps here.
Wait! What data? Do I have the data already? You should have! Someone has already defined the problem and based on this knowledge data sets should be identified.
You can have files, relational databases, NoSQL databases, graph data… whatever it can be!
Believe me or not this is where the problems really start! The first question should be – what is the data quality? I have data from my company – can I trust the data?
What about public data sources like the one you can find in the internet? Take a look into the 1 minute movie I did for you. It is all about public data.
I have not shuffled the deck. The cards were there all the time. It was just like you have seen. Sometimes you need to have some good data and think that a public data set can provide this to you. Public data means you can easily get cheated and the quality of the entire process will be very very low. Not neccesarilly will be but…. You know. It can be.
You have succesfully gathered the data and need to do some preparations now. I will post not one but many articles about the techniques of data preparation. There are lots of methods here but you should know your data set – what is the origin, what information if contains – which attributes are important? If you do not know the data set very well – how to know it better ( How to perform exploratory analysis? ). Can we reduce the number of attributes (PCA analysis)? Can we remove some data without creating a data skew? Can we introduce new features by combining the existing ones? How to perform mapping from string data to numerical data (One Hot Encoding)? Should we perform some regularization or data standarization?
Oh boy, so many topics to cover!
Based on the question you have been given (the goal) you should consider not one but more algorithms to use. There are dozens of them so how to pick the good set of the algorithms? The simplest approach here is to know whether you do a classification or a regression.
A classification is when you assign the output to one of the groups – like in our example – the email can be spam or not a spam. The classification process takes into account all input features and decides whether a new email (never seen before) is a spam or not.
A regression algorithm can predict (estimate or guesstimate if you will) a number based on the input features’ values. For example how much my car will be worth next year if it is now 3 years old and has 6.8 liters diesel engine and it is white (and many more…).
Let me name some classification algorithms here so we can play them later:
Here you are – some regression algorithms:
But how yo pick the one?
You have a data set that contains input features and the information about an outcome (an output feature). Having this in mind let’s build a model.
To do so you need to split your data set into two parts called training and testing data set. Typically the training data set contains 70% of your data and the testing set has the remaining part. Of course it is not always like this and you can assign less data to the training data set especially if you have a lot of data.
Now you ask yourself – how to divide the data set correctly? Not in terms of numbers (70%-30% split) but in terms of data quality. The good thing is that existing frameworks like scikitlearn helps us in many aspects. I will concentrate on this part in later post.
Once you have a training and testing data sets you can choose a model you would like to build. This is really the easy part when you have chosen an algorithm and you know a framework like scikitlearn a bit.
Building a model is to create an object of a specific type and feed it with data from the training data set. Sometimes a model is trained just once and sometimes it is done iteratively like in the k-fold cross validation process (more on this later).
Once the model has been trained you need to test it on the data that has never been seen by the model. It is an analogy to an exam. You can prepare yourself to an exam by study books or doing research. Then you go to the exam and your knowledge is tested. The result of the test is how good your knowledge is. The higher score the better expert you are. But if your score is not so good you need to learn more or to change the approach.
The same process you should apply on your model. You need to evaluate it and see whether of really camn ask the question you have from the inintial step of this process.
But what if the model is not working as expected? Then you have two options:
In the one of the next articles I show you how to automate this process in Python.
Are you overwhelmed by this post? I will explain all the main steps in the next articles so do not get confused! Now you should be relaxed as we will be using existing frameworks that speed up Machine Learning & AI steps I have described.
There will be even more new tasks to cover so please stay tuned. For example I will be discussing (apart from many other things I have mentioned above):
Originally posted here