Three parts dynamite, with a nitroglycerin cap. It's perfect for small homes, carports and tool sheds.
I am a co-founder of a small data science consultancy. For some time, I have wanted to do a series on the role of control and optimization in data science. This is the first post in that series. Here is an outline for this series:
The role control plays in data science is a bit like the role Uncle Fester plays in the Addams Family. It's kept in the back room and dragged out only when you want to generate some electricity.
Data Science is young, and so there are many views of what the pieces of the field are and how they fit together. Most of those views are hodgepodge, lacking cohesion: a "bag of tools" approach for the most part. Amongst other issues, this makes it difficult to describe to the consumers of data science, what we--as data scientists, simulation scientists, and computational scientists--do.
My team has spent considerable effort developing a more holistic picture of data science; one that clients and consumers not only understand and appreciate, but can also play an essential role in and make important contributions to. We are far from finished with this taxonomy, but we find it extremely useful, and thought this blog would be a good venue to offer that picture to readers.
In about 1980, the "Knowledge Pyramid", also known as the "Knowledge Hierarchy" or "DIKW Pyramid" began making its way around business and education. We show a version of it below:
The pyramid makes good sense to managers who see business as a refinement or ordering process. If they are manufacturers, they refine raw materials into useful materials. Military commanders have a similar refinement that goes from situation, to situational awareness, to situational understanding, to situational control. Even lawyers refine evidence into persuasive argument. This motif of increasing order in a system is quite universal, and we may be beginning to understand why. in his 2015 book, Why Information Grows: The Evolution of Order, from Atoms to Economies, MIT researcher Cesar Hidalgo says:
"The evolution of information cuts across all boundaries, extending even to the information begotten by our economy and society. Information, when understood in its broad meaning as physical order, is what our economy produces. It is the only thing we produce, whether we are biological cells or manufacturing plants."
The distinction between data fusion, analytics, and control is important because:
The analysis level UML class diagram which follows captures the interactions from the data science pyramid. Using standard UML symbology, the "entities" are drawn as a circle with a line beneath and the "processes" are drawn as a circle with an arrow.
The entities in the bottom tier of the figure: enterprise variables, enterprise questions, and particularly enterprise goals, are executive level: the enterprise needs to decide why it exists and what is important to it. Deciding what those goals are is the executive’s job and not the business of the data scientist. However, the decisions that support those goals are influenced by data science.
The middle, operational tier, is the realm of the operations manager whose job is to realize enterprise goals given what can be seen in the world. The operations manager uses the top tier, the information processing tier to do that. There may be other things in the information processing tier, but for data science, the components are data fusion, analytics, and control. Since we are focusing on data fusion, they are colored orange in the diagram to distinguish them from the entities. The operations manager:
Estimates the values of variables of interest to the enterprise. The operations manager needs to make the stuff from the environment, turning situation into situational awareness.
Answers enterprise questions. Situational awareness is not enough. The operations manager needs to elevate that awareness into situational understanding.
Makes decisions that produce outcomes furthering enterprise goals. Understanding the situation is good, but the payoff comes only when that understanding is used to make effective decisions.
The operations manager's job maps directly to the Knowledge Pyramid, so the data science tools (data fusion, analytics, and control) make that job tractable. Without this operational imperative, data science plays no role! The data science has value by delivering outcomes consistent with the executive objectives, and that is a control theoretic role. Data fusion and analytics support that role. In some cases the control problem is easier than the data fusion and analytics problems, and in some cases it is harder, but it’s always a part of the data science.
To reiterate, this is a data-centric view of the enterprise. For most enterprises there are other views, and the Knowledge Pyramids associated with those other views will be different. An R&D enterprise will position itself differently in its surroundings. Its inputs will be models. For instance, a new academic paper on how to use Bragg scattering in laser wave mixing to detect trace amount of explosives at a safe distance. the model-driven enterprise would implement the model described in the paper and then simulate it at progressively higher levels of fidelity. Ultimately we would use the simulation to decide on a strategy for acting on positive signals (e.g. a roadside bomb). The model-driven discipline that does this is called simulation science, and the Knowledge Pyramid for simulation science is shown below:
There is still a component to "make it relevant", another to "make sense of it", and a third to "act on it", but data fusion has been replaced by descriptive modeling (or just modeling) and analytics by simulation, and the artifacts within the pyramid layers are slightly different, reflecting the domain the models represent. But there are two key points.
As for the latter, over many years of experience, we have found that data science and simulation science are almost inseparable, and real world problems invariably require both. As a trivial example, if you have test data that you use to exercise your analytics code, you are employing simulation science to validate your data science. Mock objects in a test regime are simulation science entities.
We have, in the past, worked on some fairly large problems where this this use of simulation science to support data science was indispensable. As an example, some years ago, near the end of the Cold War, we were engaged to find spectral discriminators that would allow us to distinguish ballistic missiles from space debris. We needed to build sensors, launch rockets, then collect and process telemetry data. This is typical data science, but rocket launches are extremely expensive and time consuming. We had one shot and we needed to make sure that we collected the right data. We did a lot of simulation before designing our payload and sensors. The simulation gave us an appreciation of some spectral signatures that we otherwise wouldn't have configured sensors to detect on the real launch. One of our favorite diagrams, shown below, captures this kinship between data science and simulation science.
We have described this figure as the "Computational Science Knowledge Pyramid", emphasizing the multi-disciplinary, multi-dimensional nature of computational science. The figure shows even more clearly the unifying role control and optimization play.
Up to this point, we’ve been rather loose, talking about optimization and control as if they were the same thing. They are indeed very closely related. But the goal of this blog is to describe each and clarify some of the differences. The next post will focus on optimization, and the subsequent post will tackle control theory and describe the subtle differences between control problems and optimization problems. As a teaser, in the next post, we will show that data fusion, analytics, and control are really the same problem, though parameterized differently. Readers interested in learning more about the interplay between data science and simulation science can visit our web site, which has greater detail about the subject.