Home » Uncategorized

What Does A Data Scientist Really Do?

The world of big data and data science can often seem complex or even arcane from the outside looking in. In business, a lot of people by now probably understand the basics of what Big Data analysis involves – collecting the ever growing amount of data we are generating, and using it to come up with meaningful insights. But what does this actually involve on a day to day level for the professionals who get their hands dirty with the nuts and bolts?

To have a look under the hood of a job that some describe as the ‘Sexiest Job Of The 21st Century’ I spoke to leading data scientist Dr Steve Hanks to get an get an overview of what the work of a data scientist actually involves, and what sort of person is likely to be successful in the field.

Dr Hanks gained a PhD in computer science at Yale University, has spent 15 years as a professor of computer science and has worked at companies including Amazon, Yahoo! and Microsoft. Today he is chief data scientist at Whitepages.com where he is responsible for overseeing the Contact Graph – a database containing contact information for over 200 million people. The database is searched around two billion times every month and is the company’s primary business asset.

This database has driven Whitepage’s business since it was launched in 1997 and more recently it has diversified into app development. Caller ID, its replacement mobile user interface, queries the main Whitepages database to give more complete information on who is calling, and to help cut nuisance and spam calls. It also generates another revenue stream by providing its data to other companies to use in fraud prevention.

Key Capabilities of a data scientist

The term “data scientist” can cover many roles across many industries and organizations from academia to finance or Government. Hanks leads a team of 12 to 15 members responsible for all of the analytics at Whitepages, and their skillsets and duties vary. However, he tells me, there are three key capabilities which every data scientist has to understand.

1. You have to understand that data has meaning

Hanks makes the point that we often overlook the fact that data means something and that it is important to understand that meaning. We have to look beyond the numbers and understand what they stand for if we are to gain any valid insights from it. Hanks points out “It doesn’t have anything to do with algorithms or engineering or anything like that. Understanding data is really an art, and it’s really important.”

2. You have to understand the problem that you need to solve, and how the data relates to that

Here is where you open your tool-kit to find the right analytics approaches and algorithms to work with your data. Hank talks about machine learning – which is very popular right now, but makes the point that there are hundreds of techniques to use data to solve problems – operations research, decision theory, game theory, control theory – which have all been around for a very long time. Hank says “Once you understand the data and you understand the problem you’re trying to solve, that’s when you can match the algorithm and get a meaningful solution.”

3. You have to understand the engineering

The third capability is about understanding and delivering the infrastructure required to perform any analysis. In Hank’s words “It doesn’t do any good to solve the problem if you don’t have the infrastructure in place to deliver the solution effectively, accurately and at the right time and place.”

Source for picture: fraud detection

Being a good data scientist is really about paying attention to all three of those capabilities. You have to pay attention to the data and what it means, understand the problems and know about matching algorithms to those problems, and you have to understand the engineering to come up with solutions.

At the same time it doesn’t mean there’s no room for specialization. Hanks makes the point that it is virtually impossible to be an expert in all three of those areas, not to mention all the sub-divisions of each of them. It is okay to specialize in one of these areas as long as you have an appreciation of all of them. Hanks tells me: “Even if you’re primarily an algorithm person or primarily an engineer. If you don’t understand the problem you’re solving and what your data is, you’re going to make bad decisions.”

Key qualities of a data scientist

In terms of personal qualities, a curiosity about data is essential, as well as communications skills, says Hanks. “People on my team spend a lot of time talking to customers to figure out what problems they need to solve, or talking to data vendors to find out what they can provide. So you become a middle man and communication is very important.”

Lots of different types of people go into data science, and Hanks explained to me that he was probably not a very typical example. However in my experience there is no such thing. The key capabilities Hanks mentioned cover a broad range of skills and people of different personality types and mind sets are attracted to the profession.

“I just really loved the interplay”, Hanks says, “From the beginning I was just totally fascinated. My first exposure to data science was probably in operations research, and I just loved the idea that you could take big data sets and use them to learn things, and improve things, and I found out that you really could use them to make a difference, I’ve found that fascinating for over 30 years now.”

Even after all that time in the business though, problems still come up which have him scratching his head, and these serve as a great example of the sort of challenges data scientists find themselves struggling with on a day to day basis.

“Just this morning I was working on something and one of the algorithms just wasn’t doing what it was supposed to do – basically it was showing us a link between a particular person and a particular phone number which we just knew was incorrect. These problems can be very intermittent and very hard to diagnose.

“We have very specific algorithms that are supposed to do very specific things, and when they don’t we just have to take them apart and find out why not, the problem is these days they are very complex and have a lot of working pieces! I can be completely mystified, like I am right now … but we will get there – we always do! That’s really the sort of challenge we face day to day – systems which just don’t behave the way they are supposed to according to our schematics.”

In the time that Hanks has been working with data he has seen huge changes in the field, from working on structured databases on mainframes, to distributed Hadoop networks, to the cloud based, real time data processing world of today. So where does he see the future taking analytics and Big Data?

The Future of data science

Hanks sees a future of increased data streaming and real-time data processing, as opposed to huge batch processing of data. He believes that in this new world Hadoop MapReduce is less appropriate and in his work he is starting to use other systems like Scala and Akka.

One of the biggest challenges Hanks sees is the keeping up with the fast developments of new technologies and new algorithms. He believes that in order to be an effective data scientist you have to be holistic. He believes that it is relatively easy to become a specialist in MapReduce or a particular machine learning algorithm but the challenge is keeping up with the general speed of development in data science. “It’s a field that is just stunningly big and complex, and has incredible breadth and depth”, Hanks tells me, “You have to understand all of the pieces but the field is getting so vast – that’s going to be the challenge facing data scientists going into the future.”

Related Articles

To read more about what data scientists do, click here. The following articles are also useful:

To read other Bernard Marr articles, click here.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Leave a Reply

Your email address will not be published. Required fields are marked *