Having thoroughly enjoyed the debate around Bernard Marr's post, Why so many fake data scientists?, it occurred to me that "data scientist" is not the only problematic term in our industry. Many of the most common data-related terms and concepts are also ambiguous or poorly-defined.
Here are some of the terms that cause me frustration.
Why is data science the only field whose practitioners are called "scientists?"
In every other field, a "scientist" is a researcher, often in an academic setting. To take an example from a related field, a computer scientist is someone who works on the theoretical aspects of programming and computation. The practitioners who apply computer science to design algorithms and build software are called engineers.
By analogy, shouldn't data scientists actually be called data engineers? And that leads to the next question.
Why is someone who designs data infrastructure called a data engineer?
An engineer who designs physical infrastructure such as bridges and roads is not called a traffic engineer. He / she is a civil engineer.
Why is the infrastructure that controls the flow of data any different? By the same logic, shouldn't a data engineer really be called a data systems engineer, or something similar?
What is the definition of big data?
Does it mean:
Here are two businesses that should know the definition better than anyone, and even they do not agree.
"Big Data refers to technologies and initiatives that involve data that is too diverse, fast-changing or massive for conventional technologies, skills and infrastructure to address efficiently."
"Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured."
Big data is way too commonly used to be so poorly defined.
What does data mining mean?
Since you're reading a post on Data Science Central, chances are you know what data mining is. It involves the extraction of knowledge/insight from raw data. However, when I see someone use the term online, what they usually mean is "extracting raw data from the web."
There is no question which definition is technically right. But in practical terms, unless you are speaking with someone well versed in the subject, using the term "data mining" is likely to lead to confusion.
If the purpose of data visualization is to make data science more accessible, why is the term so technical sounding?
The results of data science are only as valuable as what the practitioner is able to communicate, which is why data visualization is such an important piece. It has the ability to take complex concepts and present them in a way that is accessible to anyone.
However, the term data visualization itself is about as inviting and digestible as a computer science text book. I like the idea of making data accessible to people who are scared of data, and I run a website dedicated to that purpose. But when I describe it as "data visualization," people's eyes glaze over.
If the entire purpose of data visualization is to remove technical barriers from communication, shouldn't it have a less technical sounding name?
Can a room full of data scientists agree on what data is (or is it "data are")?
In this case, IBM is referring to the entire digital universe, which would include all the cat videos ever uploaded to Youtube. When discussing the importance of business intelligence, would anyone feel comfortable pointing to that and calling it data?
Would love to hear some other opinions on this topic. Do you find that data-terminology gets in the way of communication?
I am a [questionably "fake"?] data scientist, financial/insurance modeler, software engineer, and entrepreneur. I also write about data and data visualization at Metrocosm and as a contributor for the Huffington Post.
Connect with me on Twitter at @galka_max