In the first article of this series, I had touched upon key concepts and processes of Data Science. In this article, I will dive in a bit deeper. First, I will define what is Statistical learning. Then, we will dive into key concepts in Statistical learning. Believe me; it is simple.
As per Wikipedia, Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis.
Machine learning is a manifestation of statistical learning techniques implemented through software applications.
What does this mean in practice? Statistical learning refers to tools and techniques that enable us to understand data better. Let’s take a step back here. What do we mean by understanding the data?
In the context of Statistical learning, there are two types of data:
The data that can’t be controlled i.e. dependent variables need to predicted or estimated.
Understanding the data better is to figure out more about the dependent variable in terms of independent variables. Let me illustrate it with an example:
Say that I want to measure sales based on the advertising budget I allocate for TV, Radio, and Print. I can control the budget that I can assign to TV, Radio, and Print. What I cannot control is how they will impact the sales. I want to express data that I cannot control (sales) as a function of data that I can control (advertising budget). I want to uncover this hidden relationship.
Statistical learning reveals hidden data relationships. Relationships between the dependent and the independent data.
One of the famous business models in operations management is the ITO model. It stands for Input-Transformation-Output model. It is simple. There are inputs. These inputs undergo some transformations. An output is created.
Statistical learning also applies a similar concept. There are input data. Input data is transformed. The output, something that needs to be predicted or estimated, is generated.
The transformation engine is called as a Model. These are functions that estimate the output.
The transformation is mathematical. Mathematical ingredients are added to the input data to estimate the output. These ingredients are called as the Parameters.
Let’s walk through an example:
What determines someone’s income? Say that income is a determined by one’s year’s education and years of experience. A model that estimate is the income can be something like this:
income = c + β0 x education + β1 x experience
β0 and β1 are parameters that express income as a function of education and experience.
Education and experience are controllable variables. These controllable variables have different synonyms. They are called as dependent variables. They are also called as features.
Income is uncontrollable variable. They are called as targets.
What do we do when we have to prepare for an examination? Study. Learn. Imbibe. Take notes. Practice mock papers. These are the tools to learn and to prepare for the unseen test.
Machine Learning also uses a similar concept for learning. Data is finite. The available data needs to be used prudently. The model built needs to be validated. The way to validate it is the following:
In a competitive examination, if the preparation is adequate, if the learning is sound then the performance on the test is expected to be satisfactory as well. Similarly, in a machine learning, if the model has learned well from the training data, it will perform well on the test data.
Similarly, in machine learning, once the model is tested on the test dataset, the performance of the model is evaluated. It is assessed based on how close it has estimated the output to the actual value.
George Box, a famous British statistician, once quoted:
“All models are wrong; some are useful.”
No model is 100% accurate. All models have errors. These errors emanate from two sources:
Let me attempt to explain this using an analogy.
Raj, a seven-year-old kid, was just introduced to the concept of multiplication. He had mastered the tables of 1 and 2. His next challenge was to learn the table of 3. He was very excited and started to practice the multiplication table of 3. His table is something like this:
Bob, Raj’s classmate, was on the same boat. His table looked something like this:
Let us examine the multiplication models created by Bob and Raj from a machine learning perspective.
The above example is a crude explanation of the important concept of variance and bias.
I have a school friend who was an amazing student. He was relatively weaker in Mathematics. The way he used to study Mathematics was through rote learning. He learned and memorized the Mathematical problems. He could “recite” them very well.
The challenge was the following: The examination problems were not the same problem that he memorized. The problems were a generalized application of the Mathematical concept. Obviously, he had a tough time clearing the exams.
Machine learning problems follow the same pattern. If the model learns too much about a particular data set and tries to apply the same model to the unseen data, it will have a high error. Learning too much from a given dataset is called as overfitting. It has not generalized the learning to be usefully applied to unseen data. On the other end of the spectrum, learning too little results in underfitting. The model is so poor that it can’t even learn from the given data.
Albert Einstein summarizes this concept succinctly. He said:
“Everything should be made as simple as possible, but no simpler.”
A constant endeavor in Machine learning problem is to strike a right balance. Create a model that is not too complex and not too simple. Create a generalized model. Create a model that is relatively inaccurate but useful.inaccurate but useful.
This balancing act is called bias-variance trade-off.
Statistical learning is the building block of complex machine learning applications. This article introduces some of the fundamental and essential concepts of Statistical learning. Top 5 takeaways from this article are:
We will delve deeper into specifics of machine learning models in subsequent articles of this series.
Comment
It's clear!
Thank you for your valuable information.
Is the following excerpt correct: 'Education and experience are controllable variables. These controllable variables have different synonyms. They are called as dependent variables. They are also called as features' ?
© 2017 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central