Machine learning (ML) is the motor that drives data science. Each ML method (also called an algorithm) takes in data, turns it over, and spits out an answer. ML algorithms do the part of data science that is the trickiest to explain and the most fun to work with. That’s where the mathematical magic happens.
ML algorithms can be grouped into families based on the type of question they answer. These can help guide your thinking as you are formulating your razor sharp question.
This family is formally known as two-class classification. It’s useful for any question that has just two possible answers: yes or no, on or off, smoking or non-smoking, purchased or not. Lots of data science questions sound like this or can be re-phrased to fit this form. It’s the simplest and most commonly asked data science question. Here are few typical examples.
Will this customer renew their subscription?
Is this an image of a cat or a dog?
Will this customer click on the top link?
Will this tire fail in the next thousand miles?
Does the $5 coupon or the 25% off coupon result in more return customers?
This algorithm family is called multi-class classification. Like its name implies, it answers a question that has several (or even many) possible answers: which flavor, which person, which part, which company, which candidate. Most multi-class classification algorithms are just extensions of two-class classification algorithms. Here are a few typical examples.
Which animal is in this image?
Which aircraft is causing this radar signature?
What is the topic of this news article?
What is the mood of this tweet?
Who is the speaker in this recording?
This family of algorithms performs anomaly detection. They identify data points that are not normal. If you are paying close attention, you noticed that this looks like a binary classification question. It can be answered yes or no. The difference is that binary classification assumes you have a collection of examples of both yes and no cases. Anomaly detection doesn’t. This is particularly useful when what you are looking for occurs so rarely that you haven’t had a chance to collect many examples of it, like equipment failures. It’s also very helpful when there is a lot of variety in what constitutes “not normal,” as there is in credit card fraud detection. Here are some typical anomaly detection questions.
Is this pressure reading unusual?
Is this internet message typical?
Is this combination of purchases very different from what this customer has made in the past?
Are these voltages normal for this season and time of day?
When you are looking for a number instead of a class or category, the algorithm family to use is regression.
What will the temperature be next Tuesday?
What will my fourth quarter sales in Portugal be?
How many kilowatts will be demanded from my wind farm 30 minutes from now?
How many new followers will I get next week?
Out of a thousand units, how many of this model of bearings will survive 10,000 hours of use?
Usually, regression algorithms give a real-valued answer; the answers can have lots of decimal places or even be negative. For some questions, especially questions beginning “How many…”, negative answers may have to be re-interpreted as zero and fractional values re-interpreted as the nearest whole number.
Sometimes questions that look like multi-value classification questions are actually better suited to regression. For instance, “Which news story is the most interesting to this reader?” appears to ask for a category—a single item from the list of news stories. However, you can reformulate it to “How interesting is each story on this list to this reader?” and give each article a numerical score. Then it is a simple thing to identify the highest-scoring article. Questions of this type often occur as rankings or comparisons.
“Which van in my fleet needs servicing the most?” can be rephrased as “How badly does each van in my fleet need servicing?”
“Which 5% of my customers will leave my business for a competitor in the next year?” can be rephrased as “How likely is each of my customers to leave my business for a competitor in the next year?”
It may not come as a surprise that binary classification problems can also be reformulated as regression. (In fact, under the hood some algorithms reformulate every binary classification as regression.) This is especially helpful when an example can belong part A and part B, or have a chance of going either way. When an answer can be partly yes and no, probably on but possibly off, then regression can reflect that. Questions of this type often begin “How likely…” or “What fraction…”
How likely is this user to click on my ad?
What fraction of pulls on this slot machine result in payout?
How likely is this employee to be an insider security threat?
What fraction of today’s flights will depart on time?
As you may have gathered, the families of two-class classification, multi-class classification, anomaly detection, and regression are all closely related. They all belong to the same extended family, supervised learning. They have a lot in common, and often questions can be modified and posed in more than one of them. What they all share is that they are built using a set labeled examples (a process called training), after which they can assign a value or category to unlabeled examples (a process called scoring).
Entirely different sets of data science questions belong in the extended algorithm families of unsupervised and reinforcement learning.
Questions about how data is organized belong to unsupervised learning. There are a wide variety of techniques that try to tease out the structure of data. One family of these perform clustering, a.k.a. chunking, grouping, bunching, or segmentation. They seek to separate out a data set into intuitive chunks. What makes clustering different from supervised learning is that there is no number or name that tells you what group each point belongs to, what the groups represent, or even how many groups there should be. If supervised learning is picking out planets from among the stars in the night sky, then clustering is inventing constellations. Clustering tries to separate out data into natural “clumps,” so that a human analyst can more easily interpret it and explain it to others.
Clustering always relies on a definition of closeness or similarity, called a distance metric. The distance metric can be any measurable quantity, such as difference in IQ, number of shared genetic base pairs, or miles-as-the-crow-flies. Clustering questions all try to break data into more nearly uniform groups.
Which shoppers have similar tastes in produce?
Which viewers like the same kind of movies?
Which printer models fail the same way?
During which days of the week does this electrical substation have similar electrical power demands?
What is a natural way to break these documents into five topic groups?
Another family of unsupervised learning algorithms are called dimensionality reduction techniques. Dimensionality reduction is another way to simplify the data, to make it both easier to communicate, faster to compute with, and easier to store.
At its core, dimensionality reduction is all about creating a shorthand for describing data points. A simple example is GPA. A college student’s academic strength is measured in dozens of classes by hundreds of exams and thousands of assignments. Each assignment says something about how well that student understands the course material, but a full listing of them would be way too much for any recruiter to digest. Luckily, you can create a shorthand just by averaging all the scores together. You can get away with this massive simplification because students who do very well on one assignment or in one class typically do well in others. By using GPA rather than the full portfolio, you do lose richness. For instance, you wouldn’t know it if the student is stronger in math than English, or if she scored better on take-home programming assignments than on in-class quizzes. But what you gain is simplicity, which makes it a lot easier to talk about and compare students’ strength.
Dimensionality reduction-related questions are usually about factors that tend to vary together.
Which groups of sensors in this jet engine tend to vary with (and against) each other?
What leadership practices do successful CEOs have in common?
What are the most common patterns in gasoline price changes across the US?
What groups of words tend to occur together in this set of documents? (What are the topics they cover?)
If your goal is to summarize, simplify, condense, or distill a collection of data, dimensionality reduction and clustering are your tools of choice.
A third extended family of ML algorithms focuses on taking actions. These are called reinforcement learning (RL) algorithms. They are little different than the supervised and unsupervised learning algorithms. A regression algorithm might predict that the high temperature will be 98 degrees tomorrow, but it doesn’t decide what to do about it. A RL algorithm goes the next step and chooses an action, such as pre-refrigerating the upper floors of the office building while the day is still cool.
RL algorithms were originally inspired by how the brains of rats and humans respond to punishment and rewards. They choose actions, trying very hard to choose the action that will earn the greatest reward. You have to provide them with a set of possible actions, and they need to get feedback after each action on whether it was good, neutral, or a huge mistake.
Typically RL algorithms are a good fit for automated systems that have to make a lot of small decisions without a human’s guidance. Elevators, heating, cooling, and lighting systems are excellent candidates. RL was originally developed to control robots, so anything that moves on its own, from inspection drones to vacuum cleaners, is fair game. Questions that RL answers are always about what action should be taken, although the action is usually taken by machine.
Where should I place this ad on the webpage so that the viewer is most likely to click it?
Should I adjust the temperature higher, lower, or leave it where it is?
Should I vacuum the living room again or stay plugged in to my charging station?
How many shares of this stock should I buy right now?
Should I continue driving at the same speed, brake, or accelerate in response to that yellow light?
RL usually requires more effort to get working than other algorithm types because it’s so tightly integrated with the rest of the system. The upside is that most RL algorithms can start working without any data. They gather data as they go, learning from trial and error.
The next and final post in this series will give lots of specific examples of sharp data science questions and the algorithm family best suited to each. Stay tuned.
About the Author
Brandon is working at Microsoft on cloud-based APIs for commercial customers with data cleaning, machine learning and visualization, managing a team of data scientists. He also teaches basic and advanced data science concepts to employees and the global community. and wrote tutorial examples and documentation in support of the Cortana Analytics Suite of cloud analytics tools. Prior to Microsoft, Brandon worked for DuPont Pioneer and Sandia National Laboratories. Brandon graduated from the MIT, and lives in Boston.