Topological data analysis has been very successful in discovering information in many large and complex data sets. In this post, I would like to discuss the reasons why it is an effective methodology.
One of the key messages around topological data analysis is that data has shape and the shape matters. Although it may appear to be a new message, in fact it describes something very familiar.
The example above is a regression line, obtained by fitting a straight line to the data points using a natural measure of fit. A straight line is certainly a shape, and in the above example, we find that a straight line fits the given data quite well. That piece of information is extremely important for a number of reasons. One is that it gives us the qualitative information that the y-variable varies directly with the x-variable (i.e. that y increases as x increases). Another is that it permits us to predict with reasonable accuracy one of the variables if we know the value of the other variable. The idea is that the shape of a line is a useful organizing principle for the data set, which permits us to extract useful information from it.
Unfortunately, the data does not always cooperate and fit along a line. Consider, for example, the data set below.
It is easy to see that no straight line faithfully represents this data.
The reason is that this data set breaks into a set of three tightly concentrated clusters. One might not initially think of this as having anything to do with shape, but after a moment’s reflection, we realize that the most fundamental aspect of any shape is the number of connected pieces it breaks into. So, in this case, we see that the shape of this data set is of fundamental importance, and that its shape is not that of a line.
At this point, we might think that we could now proceed by assuming that any data set is well approximated by a line, a family of clusters, or perhaps a family of lines. Here is another data set that demonstrates that this is not the case.
Notice that this shape also does not fit along a line, and does not break into clusters, but rather has a “loopy” behavior. This kind of structure is often associated with periodic or recurrent behavior in the data set. Here is another example.
The shape is in this case that of a capital letter “Y”. This is another kind of shape, which also occurs frequently. Note that it has a central core and three “flares” extending from it. This might represent a situation where the core represents the most frequently occurring behaviors, and the tips of the flares represent the extreme behaviors in the data. It is clearly distinct from the three other shapes we have already discussed.
One might now say that a way to understand data would be to take each of these types, and attempt to fit a template for each to the data to determine which type one is in. This fitting process is what is done in linear regression, which is the first example above. The problem with this approach is that there are an infinite variety of different possible shapes, a large number of which occur in real data sets. All four that we have shown certainly do, but many others do as well, as demonstrated in the image below.
The immense variety possible among shapes suggests that we should not attempt to enumerate all the possible shapes, and create templates for each, but rather find a flexible way of representing all shapes.
That is one of the problems topological data analysis deals with.
To give an idea of why Topological Data Analysis often works better than other methods of displaying data sets, such as scatterplots based on principal component analysis or multidimensional scaling, it is useful to consider another example.
This is a data set concentrated along an ellipse, where one axis is much longer than the other. If one applies principal component analysis or multidimensional scaling to this data set, the distances in the horizontal direction may be too small to recognize, and the data set will appear to be a line segment rather than an ellipse. One aspect of TDA is the construction of network representations of data sets. To see how this works for this data set, consider the first principal component projection for this data set.
The projection is visualized as moving from left to right. What is then done in the Ayasdi Core software application is to bin the data using overlapping intervals in the projected space on the right. Each bin is then clustered to produce nodes for a network, and the edges are inserted whenever two of the clusters contain a data point in common. This is possible because the bins overlap. The effect of this on a single bin is pictured as below.
We will then obtain two clusters for each bin except at the top and bottom, and a single cluster for each of those, yielding a circular network. The reason this gives more sensitivity is that the clustering only depends on the relative distances within the bins, and not on its absolute magnitude. So, as long as the horizontal distances between points are significantly larger than the vertical interpoint distances, we will capture the two distinct clusters, even if the absolute magnitude of the horizontal distances is very small.
This procedure gives additional resolution as well as additional flexibility in the representation of the data.
That is the power of topological data analysis.
We’ll finish by giving two examples of complex networks that arise in actual data sets.
The network above is built on data concerning health care providers. The data includes features such as utilization, payment, as well as further rate variables such as units per claim, and many others. The networks exhibit complexity in that a Y-junction appears, as well as other separated components, some of which have the shape of a line, and others which have internal structure such as loops and smaller Y junctions.
One can then color the network by outcomes of interest, such as amount of confirmed fraud located within a given node, or by average values of various scores for fraud models used by payers. In this case, the network is colored by known fraudulent providers which are localized in certain parts of the network, which is very useful information for the development of new fraud models.
Here is a second network, from data gathered by a large retailer.
The network above is constructed from purchases made by the retailer’s customers. The nodes are colored by the percentage of customers who are returning customers. Information about returning customers was not used in constructing the network. In fact, the network is built using only the customer’s first purchase information. Moreover, there is clear localization of the purchase patterns associated to different countries, as is indicated by the circled regions on the left. This permits a priori prediction about the likelihood of a given customer becoming a return customer.
In both cases, we obtained complex networks that yield information that can be used to make fraud detection more sensitive, reduce false positives, and to predict customer behavior.
Hopefully this post enhances your understanding of why Topological Data Analysis is such a powerful tool for detecting patterns and uncovering insights from complex data. To understand more about any of the examples or to inquire about our work in other industries, please don’t hesitate to reach out to us at social[@]Ayasdi[dot]com.