How to Approach a Data Intensive Problem

“It is a capital mistake to theorise before one has data.”

― Arthur Conan Doyle, The Adventures of Sherlock Holmes

Are you stuck with a problem? Great! Previously I have written a general introduction about predictive functions, and where you might use them for providing “killer” features in your applications. I argued that the data analytics will be a part of the modern software engineering. In both of these disciplines problem solving is an essential skill, and harder are the problems you can crack, the more unique applications you will get. Luckily many of the same good old principles apply. One is a test-driven design that can guide you also through the dark alleys of data modeling mysteries.

A solution for a typical prediction problem can be separated into two parts. There is a prediction model that contains the best concurrent knowledge about the subject area, or the problem domain. Another part is the inference algorithm that allows you to ask questions or queries from that gained knowledge. The prediction model is then a potentially reusable part of the software architecture while typically there is just one inference query for every predictive function.

Obviously, you need to start gaining knowledge before you can ask questions, but before looking, or even collecting the real data, it is essential to understand your problem domain well and formulate good questions. For that, I’ll advice you against the Sherlock’s quote above to theorise before having the data.

A reason for this precaution is that without a good understanding anything can be explained by the data. The larger amount of data you have, the greater are the misconclusions it allows you to make!

The traditional statistical approach is to have a predefined hypothesis that is either approved or rejected with the evidence from data. The machine learning camp is bit more relaxed, but still you need to have separated sets of data for training and testing the model. So, what can you do if you only have a faint idea about the problem and no real data at all?

Take Out Your Lab Coat!

I am currently working on a problem where we want to statistically study how human body responses to different kinds of food and how to estimate the response personally. You can create some seriously beneficial applications with this knowledge, but in addition to the data related problems listed above, the nutritional diaries and the laboratory results needed for the baseline modeling are highly personal and sensitive, and are also hard to collect in a clinically competent way. This means that you really want to be sure about your methods before you start, so that the data is collected correctly at the first time.

How to Approach a Data Intensive Problem

This is where the test-driven approach has helped me a lot. The idea is to make a best guess about the features of your data, generate the testing data and then test how well your modeling technique can capture those features. After a successful capture, it is good to tweak the data until you reach the limits of the model and it breaks down. Then you adapt and iterate again. This way you can learn intimate insights from your problem and from your modeling algorithm, be it a proprietary implementation or one of your own.

Next I will walk you through how I started this process. I try to show it also visually, and if you cannot follow all the long-winded theoretical underpinnings, just look at the images and see how things click together.

Setting the Scene

Here I took some of the actual parameters that are studied and connected them. I assumed that the nutrition parameters at the top row is affecting the lab results at the bottom, and not another way around, to justify the use of directed arrows. Note that the amount of the actual effects, if any, is unknown at this point, and that is what we like to find out.

Let’s look one of these relationships more closely. For my applications it is enough to approximate the relation to be linear. In another words, it is enough to know whether the specific nutrition increases or decreases the given blood characteristic that is connected to your well-being. The complexity comes when we have a large system of these relationships.

Linear Relations

In a linear relationship the result can be acquired from the input just by adding and multiplying it with some coefficients. For a sake of an example, let’s assume that we can find the estimation of HDL-cholesterol by adding and multiplying some coefficients to the measured amount of MUFA-fat.

Mathematically we can denote this true relationship by a linear equation

FSHDL = a + b × MUFA

Our goal is then to generate some test data that follows this relationship so that we can try to estimate those coefficients a and b. For that, let’s take some totally unrelated data and start shaping it like a clay.

This looks like it has been done with a shotgun, and it usually means that all hopes of finding anything interesting has just been shot down in flames. Only notable thing is that the data is randomly picked from a normal distribution and it varies around the center with a distance of one unit. In another words, it has variance of one.

For testing, we like to introduce a predefined relationship between a predictor and a response, for example a MUFA-fat and HDL cholesterol, that we can try to capture with our model. In statistic, the measure of this linear relationship is called covariance, and it connects to the previous linear equation with

b = covariance / variance

We can use a method from linear algebra called Eigen decomposition, or its faster twin Cholesky decomposition, to shape normally distributed data with unit variance to any given covariance. From this equation you may also notice that when variance is 1, then we can expect b in our linear equation to approach the value of covariance when amount of data grows.

Here I have used R-function mvrnorm, that uses the Eigen decomposition, to create linear relationship between MUFA and FSHDL with a covariance of 0.8. You can try yourself in R-Fiddle where I have a code snippet.

Creating a Network of Relations

Now we can zoom out for the bigger picture and repeat this procedure for all the pairs of predictors and responses, nutritional items and lab results. For testing I have made up this covariance matrix that contains variance of 1.00 in the diagonal and symmetric pairs of covariances at the off-diagonal.

Coming up manually with a working covariance matrix for a large set of variables may be a tedious task as it has to be a positive-semidefinite. It means that there cannot be any multicollinear variables that are notably dependent of each other.

We can now use a similar procedure to generate data with a whole set of relations. For our test case to work we need to generate maybe a thousand observations instead of 50 in the previous examples. As the amount of observations increase, the covariances in data approach those in the covariance matrix.

Let us finally check the data with a correlation plot where we have measured the actual correlations from the generated data. As you see, the actual correlations resemble quite well the covariance matrix that was used to generate the data.

Great! Now we have some data to play with. Let us now engineer our first model.

Modeling is Tough Business

As we assumed that the relationship between any two variables is linear, a simple linear regression model might be good starting point.

As you see, it resembles the previous linear equation, but those hats above variables mean that we are estimating those from the data while the equation described the true relationship which may be never captured, and . In addition, the regression equation contains an epsilon term that remarks the residual error of how much the model differs from each observation.

Perhaps the simplest method for learning this kind of a model is called Ordinary Least Squares (OLS) that tries to minimize those residual errors. R-function lm implements OLS and you can see the fitted black line in the image with red lines denoting the residuals.

As you remember, our data consists a network of these linear relations, and so must be with our model too. Before we go on, let’s dive into theory once more, and see what assumptions make a simple linear regression so simple.

Assumptions of Linear Regression

It is essential to know the theoretical assumptions of your modeling technique. You should think beforehand if your actual data is going to break these assumptions. The classical linear regression model that is trained with the Ordinary Least Squares method makes few quite strict assumptions. Let’s see if our newly generated data is compliant with these assumptions:

Linear relationship: Yes, that was our assumption when generated the correlation in data.
Multivariate normality: Yes, the data points are sampled from a normal distribution.
No or little multicollinearity: Yes, this was the point with the covariance matrix that didn’t allow dependent predictive variables.
Homoscedasticity: This difficult word means that those red lines denoting residual error are expected to have the same size on average. With a visual inspection we cannot see any raising or lowering trends, so I guess this is also satisfied.
No auto-correlation: Now this is bit tricky. It assumes that the following residuals are independent in a way that we cannot predict the next residual error from the previous ones. If we assume that those observations are collected from individual and independent persons, one from each, and they are not a time series of repeated measurements, then this is correct too.

The reality is even more complex for this last assumption since the data will consist repeated measurements from hundreds of people. We are definitely going to relax this rule by enhancing the model of those residual errors, but let’s stick with this simple case first.

Modeling a Network of Relations

The structure of our problem with many inputs, many outputs, and possibly intermittent variables, resembles me a graph. So, my logical choice for a modeling technique is a probabilistic graphical model (PGM). And because the relationships between variables are directed, the model should be a Bayesian network rather than a Markov network. It is essential that the structure of the model resembles the structure of the problem.

PGMs have been successful in modeling complex systems, like gene interactions, and their strength comes in modeling indirect relationships that jump over a one or several nodes. Bayesian networks have a nice property that allows me to combine manually engineered expert knowledge from journals and interviews, and enrich it with data mining.

The Bayesian networks differ from the neural networks in a level of abstraction that they operate. While the neural networks try to mimic the function of neurons in a brain, the Bayesian networks store a knowledge about concepts and their relationships. The probabilistic relationships between concepts can be described very flexibly and the reasoning with the concepts is efficient as the acyclic network structure helps to prune out the irrelevant concepts.

In this example all the relationships between variables are linear and the data are known to be from a normal Gaussian distribution. This makes the network specifically a Gaussian Bayesian Network (GBN).

Executing the Test

Training a Gaussian Bayesian Network includes two parts. First, the structure of the network is explored. I have given the training algorithm a prior knowledge that only the relationships between nutritional items and blood characteristics are relevant. After that the training algorithm has to decide the significant relationships. The Incremental Association (IAMB) algorithm uses the Pearson correlation as a measure of the linear association between the variables in data. This information is then used to include only the significant relationships in a graph.

You can think the resulting GBN graph as a set of linear equations we looked at earlier. The data mining in a sense discovers you a correct set of equations. You can see that it has discovered that the level of HDL-cholesterol depends on both MUFA and PUFA-fats. This is a feature that I planted in the test data though the covariance matrix.

After the structure learning, the next step is to discover how big impact the variables, or nodes in the graph, have on each other. At the simplest setting, this is done again with the OLS algorithm that discussed earlier. Now, instead of one predictor and one response, it is applied over the whole graph and you can see how well those estimated bvalues match to the earlier covariance matrix and the correlation plot.

Breaking the Rules

If earlier was the happy path of the testing, let us now consider what happens if we break the theoretical assumptions of our OLS modeling method. There is a Gauss-Markov theorem that proofs that with those assumptions the Ordinary Least Squares results the best estimator you can get. Now if we don’t take their word for it and we introduce some kind of a structure in the data, things go haywire. If I add measurements from dozens of different people to the same dataset, and repeat the test, we can see that the levels of the estimated values are off from the generated data that we expected. The measurements are not independent anymore, but grouped by person.

To fix this, we would need to use maybe a Generalized Least Squares (GLS) method that allows us to relax the assumptions about the autocorrelation. Better yet, we could use a Mixed-Effect modeling that allows us to combine the population level estimates and the personal estimates. Actually, the very latest research in a field is focusing on combining Mixed-Effect models with Bayesian networks. I think I’ll leave that beefy part for you to read from my forthcoming scientific article.

The Art of the Problem Solving

How this all now relates to problem solving? What we really want to achieve is to understand the phenomenon that is behind those measurements in a data. For example, how human body responses to different kinds of nutrition. Once we have successfully captured the essence of this process and formed a model, we can just ask answers to our problems from this model of knowledge. Now that would make even Georg Polya proud! Of course, this inference can be also a (NP-)hard task in itself and it deserves a closer look in another article.

I hope that from this example of mine you can find some ideas of how to get started with a data intensive problem, even if you don’t have the actual data yet. Generating the close enough data for testing can help you to have confidence in your modeling implementations and gain new insight about the problem itself.

This should be elementary, even for Watson ™