Astrophysicist and data scientist Kirk Borne, Ph.D., was among the first to comprehend the importance of vast increases in data as a NASA scientist for almost two decades and now professor of Astrophysics and Computational Science at George Mason University. He’s among the top “influencers” on matters relating to “big data,” And IBM this year named him a Big Data and Analytics Hero, a recognition program to acknowledge top thought leaders. In this blog, which will appear weekly in the next three weeks, he speaks with Pelin Thorogood, Anametrix CEO, on a range of topics relating to the uses and abuses of big data and analytics in science, government and business.
Pelin: Tell us about your career, in particular your move from astrophysics to data science. How did that take place? Give us some insight.
Kirk: That’s a long story, but let me give you a short answer. During the first 18 years of my career, I worked on various NASA projects as an astrophysicist. I was seeing more and more growth in the volume of data we were taking in, handling, and serving to the NASA scientific user community. About 15 years ago, I had an epiphany, so to speak. A colleague of mine had been working on a large astronomy project at a Department of Energy lab. While there was no national security issue associated with the data, the lab couldn’t serve the data publicly from that facility, and so my colleague asked me if we could serve the data through the astrophysics data facility at NASA. We were talking about just two terabytes of data, which isn’t much today. I have two terabytes on my laptop. But in 1998, that was quite a lot of data. NASA managers told me that the agency had done about 15,000 space science experiments to that date, and data from all of those experiments combined amounted to less than two terabytes. They couldn’t imagine taking in just one experiment’s data that would double the resource demands on the data facility. That was very revealing. I realized at that moment that the data volumes and what we could do with them were drastically changing. At the same time, as part of my discussion with my Department of Energy colleague, I heard about the concept of data mining and realized it could be important to the future of my own and others’ scientific research: doing discovery from NASA data collections, finding the patterns, trends, correlations, surprising things in the data stack. At that point, I launched my career into data science. My later years at NASA were a combination of astrophysics and data science, before I moved to George Mason University in 2003.
Pelin: Since you are teaching now, what have you found that’s changed in terms of interesting careers in data science?
Kirk: One of my goals when I came to the university was to help create an undergraduate degree program in data science. George Mason University has had a Ph.D. graduate program in what we call Computational Science and Informatics since 1991. It’s clear that the people who created that program were 20 years ahead of their time. It’s been very successful, having now graduated more than 200 Ph.D.s. I worked with my colleagues at George Mason to develop a plan and curriculum for undergraduate data science, a Bachelor’s degree program. And in 2007 we accepted our first students, about five years before awareness of big data went mainstream. In March 2012, for example, the White House National Big Data initiative was announced. And it was in October of that year that the Harvard Business School Review called data science “the sexiest job of the 21st Century.” Now many more students have discovered data science. In our Ph.D. program we’ve seen a significant shift in the interest of our graduate student applications toward data science.
Pelin: Let’s move to your TED talk, where you talk about what we can learn from data in three categories: “Known Knowns;” “Known Unknowns;” and “Unknown Unknowns”. Of these, which will lead to the biggest discoveries and why?
Kirk: This concept came to me when I was doing that initial data mining research at NASA in 1998 and 1999. I realized much of what we could discover in data was already known to the scientific research team offering it. I called this category, the known “knowns” − we confirm what we already know inside the data. The second category is called the known “unknowns”. You expect certain behaviors, for example, but no one had found them because the signal was too weak or it had not been explored in enough depth. But the biggest potential for discovery by far are the things that you never expected to find in the data. These are the unknown “unknowns.” I wrote a couple of articles on the unknown “unknowns” and put together a website at NASA on scientific data mining resources. That gave me a bit of a reputation for being a data mining guy at NASA, so much so that I got a call from the Executive Office of the President after 9/11 to talk about data mining. I didn’t end up briefing the president, but did spend time with his staff. And these three concepts percolated over to the Secretary of Defense where the notion of unknown “unknowns” was applied to unforeseen terrorist threats. In science, of course, the big discoveries are those you could never imagine until the data reveal them.
Pelin: Thank you, Dr. Borne. In Part II of our blog next week, we’ll learn more about the top areas that business will benefit from big data in the next few years. So stay tuned for more from this astrophysicist and data scientist.