Control: The "Uncle Fester" of the Data Science Family (part 1--The Knowledge Pyramid)

Three parts dynamite, with a nitroglycerin cap. It's perfect for small homes, carports and tool sheds.

Fester Addams

I am a co-founder of a small data science consultancy. For some time, I have wanted to do a series on the role of control and optimization in data science. This is the first post in that series. Here is an outline for this series:

  • Part 1 (this part): Deals with a general taxonomies of data science and simulation science, and how the two complement one another. It also identifies control as a key component of both data science and simulation science.
  • Part 2: Deals with a taxonomy of optimization. Identifies optimization and data fitting (e.g. regression analysis) as the same problem.
  • Part 3: Moves into the theory of control.
  • Part 4 (pending): Deals with difficult issues in control in the real world; things like selecting the best controls in the absence of important information, or in the presence of deceptions in the operating environment.

The role control plays in data science is a bit like the role Uncle Fester plays in the Addams Family. It's kept in the back room and dragged out only when you want to generate some electricity.

Data Science is young, and so there are many views of what the pieces of the field are and how they fit together. Most of those views are hodgepodge, lacking cohesion: a "bag of tools" approach for the most part. Amongst other issues, this makes it difficult to describe to the consumers of data science, what we--as data scientists, simulation scientists, and computational scientists--do.

My team has spent considerable effort developing a more holistic picture of data science; one that clients and consumers not only understand and appreciate, but can also play an essential role in and make important contributions to. We are far from finished with this taxonomy, but we find it extremely useful, and thought this blog would be a good venue to offer that picture to readers.

In about 1980, the "Knowledge Pyramid", also known as the "Knowledge Hierarchy" or "DIKW Pyramid" began making its way around business and education.  We show a version of it below:

The pyramid makes good sense to managers who see business as a refinement or ordering process. If they are manufacturers, they refine raw materials into useful materials. Military commanders have a similar refinement that goes from situation, to situational awareness, to situational understanding, to situational control. Even lawyers refine evidence into persuasive argument. This motif of increasing order in a system is quite universal, and we may be beginning to understand why. in his 2015 book, Why Information Grows: The Evolution of Order, from Atoms to Economies, MIT researcher Cesar Hidalgo says: 

"The evolution of information cuts across all boundaries, extending even to the information begotten by our economy and society. Information, when understood in its broad meaning as physical order, is what our economy produces. It is the only thing we produce, whether we are biological cells or manufacturing plants." 


So the Pyramid basically provides a road map for what an enterprise does, with some milestones in that ordering process. We gather stuff from our surroundings; we make it relevant; we make sense of it; we act on it. And if the enterprise is rational, those actions further some goal which is likely to be embodied in the enterprise's mission statement. Different enterprises distinguish themselves from one another by where they are in the surroundings (what data is available to them: what things they can gather from the environment) and the processes they use to refine that data. 

The things inside the pyramid are "entities" in the language of UML analysis modeling: facts, observations, rules, etc. These are artifacts of the refinement. But the pyramid itself gives no guidance on how to refine those artifacts and elevate them to the next level. Those procedures are how the enterprise creates value. In the UML analysis modeling language, those things are officially called "control classes" but the term "control" has a very specific meaning in this discussion so we will call these "processes" here, to avoid confusion.

Where your enterprise is positioned in the surroundings (what you gather from the environment) determines the nature of the refinement processes, and the specific nature of the artifacts at each level of the pyramid. A data-driven enterprise will gather data from the environment. And that data will be limited to what is observable; the observables seldom exactly specify the values of variables of interest to the enterprise. Take a weather forecasting system as an example. Again, we simplify greatly for the sake of illustration. The enterprise wants a weather map that provides a temperature and wind speed for any location at any time. The observables are temperature and wind speed at specific times and places. The observables need to be "fused" into a model of the variables of enterprise interest. Sometimes the observables are closely related to the variables of enterprise interest (like the weather example) and sometimes they are not. For Google's Flu Trends system, the observables were search queries, and these needed to be fused into influenza indicators. The fact that this worked at all shows how powerful data fusion can be. 

We occasionally see puzzlement when we use the term "data fusion". Suffice it to say that it's not jargon that we invented. The term "data fusion" has been around for decades. There are technical books on the subject by scholars like Yaakov Bar Shalom and Jim Linas, and conferences and proceedings, and government funded programs specifically intended to further the practices of data fusion. There is a lot of formalism to data fusion, and data science benefits from that formalism. In fact, many people in the data fusion community believe that data fusion is data science. We think data science is more. We think that data science is the set of processes in purple at the left of the figure below. It starts, at the bottom, with observations of the environment and ends at the top by making observable changes to the environment.

At each layer, there are artifacts, and between layers are processes that transform the artifacts at one layer into the artifacts at the next layer up. This can be summarized as follows:

  • Data Fusion: Transforms observations of the world into models of variables,
  • Analytics: Use those variables to answer questions,
  • Control: Uses those answers to produce effective actions.

The distinction between data fusion, analytics, and control is important because:

  1. It helps us to state a problem clearly. If we are trying to produce an estimator then it's probably a data fusion problem because estimators are models of variables; if we are trying to make a decision it's probably a control problem.
  2. Once we've got a crisp definition of the problem, we can decide on techniques for solving it. Applying data fusion techniques to an analytics problem is destined to produce poor results just as is applying analytics techniques to a control problem.
  3. It helps the client understand what they are dealing with and break a big problem into manageable parts.  

The analysis level UML class diagram which follows captures the interactions from the data science pyramid. Using standard UML symbology, the "entities" are drawn as a circle with a line beneath and the "processes" are drawn as a circle with an arrow.


The entities in the bottom tier of the figure: enterprise variables, enterprise questions, and particularly enterprise goals, are executive level: the enterprise needs to decide why it exists and what is important to it. Deciding what those goals are is the executive’s job and not the business of the data scientist. However, the decisions that support those goals are influenced by data science.

The middle, operational tier, is the realm of the operations manager whose job is to realize enterprise goals given what can be seen in the world. The operations manager uses the top tier, the information processing tier to do that. There may be other things in the information processing tier, but for data science, the components are data fusion, analytics, and control. Since we are focusing on data fusion, they are colored orange in the diagram to distinguish them from the entities. The operations manager:


  • Estimates the values of variables of interest to the enterprise. The operations manager needs to make the stuff from the environment, turning situation into situational awareness.

  • Answers enterprise questions. Situational awareness is not enough. The operations manager needs to elevate that awareness into situational understanding.

  • Makes decisions that produce outcomes furthering enterprise goals. Understanding the situation is good, but the payoff comes only when that understanding is used to make effective decisions.

The operations manager's job maps directly to the Knowledge Pyramid, so the data science tools (data fusion, analytics, and control) make that job tractable. Without this operational imperative, data science plays no role! The data science has value by delivering outcomes consistent with the executive objectives, and that is a control theoretic role. Data fusion and analytics support that role. In some cases the control problem is easier than the data fusion and analytics problems, and in some cases it is harder, but it’s always a part of the data science.

To reiterate, this is a data-centric view of the enterprise. For most enterprises there are other views, and the Knowledge Pyramids associated with those other views will be different. An R&D enterprise will position itself differently in its surroundings. Its inputs will be models. For instance, a new academic paper on how to use Bragg scattering in laser wave mixing to detect trace amount of explosives at a safe distance. the model-driven enterprise would implement the model described in the paper and then simulate it at progressively higher levels of fidelity. Ultimately we would use the simulation to decide on a strategy for acting on positive signals (e.g. a roadside bomb). The model-driven discipline that does this is called simulation science, and the Knowledge Pyramid for simulation science is shown below:

There is still a component to "make it relevant", another to "make sense of it", and a third to "act on it",  but data fusion has been replaced by descriptive modeling (or just modeling) and analytics by simulation, and the artifacts within the pyramid layers are slightly different, reflecting the domain the models represent. But there are two key points.

  1. Control has not been replaced. It is key to delivering value in both data science and simulation science problems, and
  2. Data science and simulation science, while different in their parts are similar in the way the parts go together.

As for the latter, over many years of experience, we have found that data science and simulation science are almost inseparable, and real world problems invariably require both. As a trivial example, if you have test data that you use to exercise your analytics code, you are employing simulation science to validate your data science. Mock objects in a test regime are simulation science entities.

We have, in the past, worked on some fairly large problems where this this use of simulation science to support data science was indispensable. As an example, some years ago, near the end of the Cold War, we were engaged to find spectral discriminators that would allow us to distinguish ballistic missiles from space debris. We needed to build sensors, launch rockets, then collect and process telemetry data. This is typical data science, but rocket launches are extremely expensive and time consuming. We had one shot and we needed to make sure that we collected the right data. We did a lot of simulation before designing our payload and sensors. The simulation gave us an appreciation of some spectral signatures that we otherwise wouldn't have configured sensors to detect on the real launch. One of our favorite diagrams, shown below, captures this kinship between data science and simulation science. 

We have described this figure as the "Computational Science Knowledge Pyramid", emphasizing the multi-disciplinary, multi-dimensional nature of computational science. The figure shows even more clearly the unifying role control and optimization play. 

Up to this point, we’ve been rather loose, talking about optimization and control as if they were the same thing. They are indeed very closely related. But the goal of this blog is to describe each and clarify some of the differences. The next post will focus on optimization, and the subsequent post will tackle control theory and describe the subtle differences between control problems and optimization problems. As a teaser, in the next post, we will show that data fusion, analytics, and control are really the same problem, though parameterized differently. Readers interested in learning more about the interplay between data science and simulation science can visit our web site, which has greater detail about the subject.


Views: 4262


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Jim Krueger on December 8, 2015 at 6:09pm

I've been using the data-information-knowledge-wisdom continuum in many communications, goals, and practice for about 15 years now.  It was very reassuring to see for the first time the same terminology show up in pyramid form (although author noted that this pyramid first appeared about 1980).  There is certainly much more structure and detail here, but it aligns quite nicely with my instinctive thinking.  This should help improve my rigor and further crystallize my thought process and activities going forward.  Thank you for the post!

P.S. I did not include Uncle Fester whatsoever in my thought process, so will endeavor to do so as applicable in the future.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service