One of my favorite things about Topological Data Analysis (TDA) is how malleable it is, because its methods are both general and precise.
These properties might sound incompatible, but perhaps I can explain it better describing how I used to approach data analysis problems before I started working on TDA.
Naturally the first step (then and now) was to try to understand the data and what is wanted from it.
Next I would reflect on the methods I had previously used and hope one of them would fit well enough to try. If that didn’t work, I would try to make a parameterized mathematical model of the data that seemed plausible and see how to reliably estimate the parameters from the data. I would then check the data against the model predictions and if they were okay, study the model until I could see how to formulate and solve the problem, which would give me the answer I wanted.
In a lot fewer words, if the solutions I used before wouldn’t work, I would start over from scratch.
It is certainly a general process, but it is very time-consuming.
By contrast, the general TDA method for a new problem begins with understanding the data enough to come up with a metric on it. As a reminder, when we say metric what we mean is a notion of similarity or distance between rows of data. There are multiple ways to measure (or define) such similarity – Hamming is one, Euclidean distance is another.
Determining a good metric is a more constrained task than making a mathematical model of the data. Further, with Ayasdi’s suite of existing statistical tools and visualization support, it is also fairly straightforward to get some idea if the metric is a good one.
I contend choosing a metric for a data set is a universally applicable process: for any data set and any concrete problem on that data, if there is an operational solution to the problem there is, at least implicitly, a metric embodied in that solution. That is what I mean which I say a metric is ‘general’.
At the same time a metric is quite specific: it is a function on pairs of data points satisfying certain properties, and if that function is implemented according to an API used in our software, you can add that metric and all of the code in our system written to apply to ‘metric spaces’ (which is almost all of the code) will just work. That is what I mean when I say a metric is ‘precise’.
If, instead, you used a panel of experts to decide on the similarity of pairs of data points, that would not be something you could use with our code – it would be ‘imprecise’ in the sense I am using the word it here, though you might argue that it would be a process which could be generally applied.
There is another sense in which a metric is precise and that is mathematical precision. This type of precision means the idea can be reasoned about, and sufficiently to create books full of theorems on metric spaces (which are themselves generalizations of a wide range of concepts from many branches of mathematics). And I contend that this sort of precision means we can also reason about the abstraction of ‘metric’ as easily as we might reason about Euclidean distance, and come up with solutions to problems in a specific metric space which apply to ALL metric spaces.
This an illustration of what I mean when I say TDA combines generality and precision.
Once you have defined a metric, the rest of TDA applies instantly. This has broad implications, for, as we noted a moment ago, there are many ways to measure similarity and each of those becomes a candidate to be consumed by TDA. Which approach you choose (or the software chooses for you, if you enable that) does not otherwise alter the process.
These characteristics of TDA are as much philosophical as they are about our specific software implementation. The fact that any metric will work with the methodology can be difficult to get one’s head around.
While it may be challenging at first, thinking about data sets as finite metric spaces is a very powerful way of framing problems. Once it becomes familiar, it can guide the search for solutions to problems that might otherwise be intractable.
This is key.
Teams that use TDA in their work see the “art of the possible” more broadly and can attack problems that might otherwise be “too hard” using traditional techniques.
To see how this works, I will describe the process we used in developing the underlying machinery for Ayasdi Care while illustrating some of the points using a much simpler, open source data set. Ayasdi Care has won awards and accolades for its ability to rapidly solve a problem considered to be very complex.
The problem was how to help medical professionals generate the ‘best practices’ among their colleagues for a range of surgical procedures.
Even a relatively routine surgical procedure, such as a total knee replacement, involves a complex series of interactions of many highly trained people with the patient over days or weeks.
Even abstracting away many of the details (e.g., when going in to administer an antibiotic the nurse also tells the patient a joke – this is recorded as ‘administer antibiotic X, dose Y’ ), the records for a procedure can list hundreds of ‘events’ which could occur in a staggering number of possible sequences.
After consultation with the data experts and experimentation, we devised a metric to describe “event sequences” that led to confirmatory findings as well as new insights.
The process of constructing a metric on these sequences is outside the scope of this blog post, and the confidentiality of the patient records involved would make it difficult or impossible to illustrate the points I wish to make here, so instead I will described those using data from the MNIST data set (see https://en.wikipedia.org/wiki/MNIST_database).
The particular version of the data I use here was 5,000 samples chosen at random from the full data set, and the images were encoded as 28 by 28 pixel gray-scale images. It is possible to get decent resolution on the data set simply by using the Euclidean metric on the 784 integer coordinates of each image. Here is a picture of the space (a scatter-plot of the values of the two neighborhood lenses) where individual points are colored by the numeral from 0 to 9 the point represents:
The association between colors and classes is arbitrary, but the labels in the middle of the concentrations for each class should make it clear which is which. Taking the average of the images in each class we get recognizable numerals:
Since it is the product of human activity, this data set has some interesting features despite its relative simplicity. If we look at the averages of the 1’s on the left, respectively right, side of the main body of 1’s, we see the following:
Another way to say this is that even though there is an overall consensus as to how a ‘1’ should be written in this sample, there are subgroups within these samples with their own ‘sub-consensus’. That variation can be seen in the average of all the ‘1’ images – the blurs at the top and bottom correspond to these two distinct slopes. (Perhaps the two versions correspond to right and left-handed people?)
For the other numerals the same kinds of variations of occur, although they are not as easy to characterize. Here are the averages across three vertical stripes in the large group of ‘7’ images:
and for two stripes across the 5’s:
Other numbers display overlap, making them difficult to discern. Notice that in the picture the ‘4’ and ‘9’ groups overlap a great deal.
This led me to expect that the samples had lots of 4’s with the loops almost closed and 9’s with them barely closed, which I thought was confirmed by this average of everything on the left side of the 4/9 blob.
However, after I split the samples by classes I suspected this was not so, and looking at a number of individual points let me to conclude that these pictures are more accurate representations of what is happening in that region:
The individual points are perfectly distinguishable – the problem is the Euclidean metric is too simple-minded.
There are, however, TDA methods for generating features, which indicate the presence of loops in noisy data and adding those features would disambiguate almost all of the points in the 4/9 blob that I examined.
This particular example illustrates an important point: when constructing consensus examples it is necessary to be careful – averaging isn’t necessarily the best approach even when it is feasible. For instance, if some clustering mechanism selected a circular or spherical group in a high-dimensional space, an averaging procedure on such a set could compute, as a consensus, a point which was ‘in the hole’ (and which might well be a point excluded by some external constraint).
Whether or not this is okay would depend on the application, but it is worth noting. It was precisely such a consideration which led us to build our consensus recommendations in Ayasdi Care by starting with a most typical sequence to which we added and removed events.
However, in the case of these images, averaging seems a perfectly good method, provided we choose good points to average. Consider the average of all the ‘0’ images again:
This is fairly fuzzy, and it’s easy to see why if we look at just the ‘0’ images:
There is fairly well-defined core, but there are quite a few outliers, such as this one:
What we would like to do is to eliminate the atypical points from the set.
The question becomes what is a TDA way to do so? Or, conversely, what is a TDA way to find the typical points in a set?
Here is a very simple one: define ecc(x,S) to be the sum over all s in S of distance(x,s). Intuitively, the closer x is to ‘the middle of S’, the more distances will be smaller. If we take s0 to be a point in S which minimizes ecc(x,S) for x in S, then we could think of that as a most typical point. In simple cases this corresponds to our intuition. For example, if S is a set of real numbers and distance(x,y) = |x-y|, then a median point of S will minimize ecc(x,S). (Note that this is far from the only possible such function.)
For this use we could simply take the image of a ‘most typical’ as the consensus,
but this example is a bit ‘rough’, and certainly in the more complicated case of Ayasdi Care, taking a consensus from one sample leaves out far too much information. What we can do here is take the 20% ‘most typical’ points, which we can call the ‘20% core’, and average those. Doing so gives us this image:
It is recognizably like our most typical point, but smoother, and we can have more faith in it since it was created from 101 samples. And as a further sanity check, we can look at the set of points we averaged in the picture of the space:
Any two-dimensional representation of a 784-dimensional space is bound to be imperfect, but it is reassuring to see that the core, which was computed using only the metric and without reference to the picture, lies in the middle of the biggest blob of points.
What we have done is describe a process for consensus construction for images of handwritten digits. But note that, except for the method for merging a core set into a consensus, the method applies to any finite metric space.
This is another example of the malleability of TDA – we have come up with an intuitively appealing, principled method for consensus construction that is quite general. And this is just what we used in Ayasdi Care, with a suitable method for finding a consensus in a core set.
One thing that we have not dealt with is how we might recover consistent variations – that is, how to detect when there is more than one consensus. It can also be done in a natural way, but we’ll leave that to a subsequent blog post.