Interesting to read what some statisticians write about data science, on the American Statistical Association (ASA) blog. Most of us don't care about our job title - there are so many breeds of statisticians and data scientists after all - and they do overlap to some extent. While I was once a statistician, I now call myself data scientist or business scientist. Anyway, below are some extracts from very lively and interesting discussions taking place on the ASA blog.
Tommy Jones posted The Identity of Statistics in Data Science on the American Statistical Association (ASA) website in December 2015. In his long and very interesting article, he wrote (this is just a tiny extract):
Judging by current statistics curricula, statistics is more closely tied to the mathematics of probability than to fundamentals of data management.[...] As models have become more accurate, they have also become more complex.
Dogling Yan commented:
In that data analyst job, I barely used any statistical models because people don’t really care about p-values. Also, with the size of current datasets, p-values are always very small. The models, analysis methods that most people learned at school are not very useful since the simple model and more valid and complex models tend to give the same conclusion when sample size is large.
My comment:
As a data scientist, I work on making models (actually, absence of models, but instead data-driven systems) simpler, not more sophisticated, and fit for black-box processing of big data in production mode. That is, robustness is more important than 100% accuracy, especially if your data is 70% accurate. And also, I work on designing a new statistical framework that is free of mathematics, traditional probability theory, random variables, and so on - so that anyone who know Excel can learn it. Even to compute confidence intervals or more elaborate forecasting systems. It will be published in my upcoming book, Data Science 2.0.
Jennifer Lewis Priestley also posted on ASA, in January 2016: Data Science: The Evolution or the Extinction of Statistics?
In this article, she wrote:
While data scientists can do a great many things I can’t do—mainly in the areas of coding, API development, web scraping, and machine learning—they would be hard pressed to compete with a PhD student in statistics in supervised modeling techniques or variable reduction methods.
My comment:
Read my article about a fast, efficient, combinatorial algorithm for feature selection using predictive power to jointly select variables. It is the data science approach to variable reduction and variable generation. Likewise, supervised modeling - which it also belongs to machine learning - is not foreign to data scientists. Read about my automated indexation/tagging algorithm, used for taxonomy creation/maintenance or cataloguing: it performs clustering of n data points in O(n), and can cluster billions of web pages in very little time. It is also used to turn unstructured data into structured data.
And my reply to someone (Peter) who commented on LinkedIn, saying that "the feature selection method mentioned in the blog is still a heuristic method i.e. no guarantee to find the optimal subset of variables."
Peter, data scientists are usually interested in local optima, easy to detect, and that provide almost the same yield as the global optimum which has two drawbacks: (1) the global optimum could be an unstable optimum, and (2) it might take far more time to compute if the data set is immense.
About the author: Vincent Granville worked for Visa, eBay, Microsoft, Wells Fargo, NBC, a few startups and various organizations, to optimize business problems, boost ROI or to develop ROI attribution models, developing new techniques and systems to leverage modern big data and deliver added value. Vincent owns several patents, published in top scientific journals, raised VC funding, and founded a few startups. The most recent one - Data Science Central - is growing exponentially, and delivers a substantial profit margin. Vincent also manages his own self-funded research lab, focusing on simplifying, unifying, modernizing, automating, scaling, and dramatically optimizing statistical techniques. Vincent's focus is on producing robust, automatable tools, API's and algorithms that can be used and understood by the layman, and at the same time adapted to modern big, fast-flowing, unstructured data. Vincent is a post-graduate from Cambridge University.
DSC Resources
Additional Reading
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge
Comment
To process data correctly, a data scientist needs to know how to compensate the errors left in the previous steps by the following steps. Each step does not have to be perfect. But the final result has to meet the requirements. The process has to be convergent.
Jennifer Priestly said: "While data scientists can do a great many things I can’t do—mainly in the areas of coding, API development, web scraping, and machine learning—they would be hard pressed to compete with a PhD student in statistics in supervised modeling techniques or variable reduction methods".
I assume that comment was made in ignorance of stochastic gradient boosting (as implemented in TreeNet or GBM), or Bayesian networks.
Tom.
I started as an applied mathematician and called myself a Mathematical Modeller, to distinguish myself from Statistician. Later also became a computer science and machine learning academic.
My view, which hasn't changed much over the years is that many statisticians follow recipes and organise data to suit the recipes they know (sometimes referred to as experimental design).
A mathematical modeller works outside the box, inside the box, in the world, and in the constraints of mathematics (including probability theory). Then add a dash of modern machine learning. This is closer to what a Data Scientist does (leaving the Hadoop/DWH management to Data Engineers, +/-).
Data Science uses a scientific approach on several levels:
Tom.
Data Science comprises many recognized fields of study-mathematics, statistics, computer science etc., it is like a newborn baby, his or her parents can decide to give it any name, the given names do not in anyway defeat/ change the baby's genetic make-up.
The major contention Statisticians see is that whiles their traditional methods have been heavily borrowed by data science, it is the so called data scientist who claiming to have a better impact or put another way predicting the death of the almighty Statistician. Time will tell!!!!
I do not see a lot NEW in the 'NEW Data Science' it is just an opportunity to solve problems by combining different standardized methods and approaches borrowed from many other well established fields, which Statistical Science is the one most heavily affected.
The springboard for the development and popularity in Data Science is the advances in computing power, for example, Neural Networks predates data science by decades but its applications were limited. The computing power has allow us to apply the existing mathematical/statistical methods to large/Big Data sets.
The greatest deception I see, is commonly naming of 'algorithms' instead of methods or approach. The algorithms are the computing commands for solving particular problem, we solve the problem by applying a methods such as Logistics regression to an appropriate data set, so writing a Logistic regression algorithm does not in any way change the assumptions underlying the logistic method/approach to solving problems. Example you cannot use logistic regression with continuous dependent variable, that will not work whether you call it an algorithm or approach or method!!!
The field of Data Science is great addition to development in science and knowledge discovery but practitioners must recognize/acknowledge those fields that it has borrowed from accordingly with the due respect.
Sione, the news you mention about gravitational waves is quite pertinent.
If one reaches the news in a specialised source (Physics Letters if I recall), you will find that the researchers applied the model I mentioned below (from Terry Speed's talk)
question + data -> model -> answer + uncertainty.
However, a closer scrutiny reveal that they indeed used two models. In Speed's terminology the "final" model was
question + data -> physical_model + random_model -> answer + measure of uncertainty.
Which is the one they said that "confirmed" gravitational waves.
But prior to this one they use the following.
question + data -> random_model -> measure of uncertainty.
This is, they looked at the data without any model in particular, just at its own intrinsic randomness. Which area of mathematics deals with randomness alone? Statistics, as I said!
Yes, they user pure statistics, nothing else, to say/claim that "there was something happening there" that deserved attention before using the physical model. The physical model, if valid, had to reduce the level of uncertainty from a pure random model. And it did.
This further clarifies the relevance of Statistics (within the broad area of mathematics) for data science.
Very relevant piece of news indeed!
Quote : "Empirism is a part of the phylosophy based on induction not deduction."
So, what? The unprecedented success of today's modern physics is largely deduction.
Quote : "Where is the Scientific Method in Maths? "
What an ignorant question.
See my quote above. Physics' success in the last 100 years were mainly deduction. Isn't that scientific or not?
Einstein developed his General theory of relativity in 1916 & out of the prediction of this theory pops out "Gravitational Waves". Guess what, Gravitational waves just been experimentally observed, so Einstein foresaw the existence of Gravitational Waves in 1916, not using big data, but by mathematical deduction in his general theory of relativity.
"Gravitational waves have been detected for the first time"
http://www.economist.com/news/science-and-technology/21692851-gravi...
Guess what, Einstein used Tensor Calculus in the development of his General Theory of Relativity. He brought tensor math forward to attention of mathematic & physics communities and todate it has also captured the attention of other communities in Statistics, Information theory, machine-learning and also Bioinformatics.
"A tensor higher-order singular value decomposition for integrative analysis of DNA microarray data from different studies"
http://www.ncbi.nlm.nih.gov/pubmed/18003902
So, I guess that the use of Tensor math in bioinformatic is not scientific, is it?
> All right Carlos, I agree. We simply have different perspectives
Meaning "I agree but I don't" ?
> For you, mathematics axioms that are true by themselve are all.
Sad you put in my mouth things I have never said.
> "You are a ..."
You don't know who I am. I don't know either :)
> ... in some special circumstances mathematical asumptions fail
You don't say that a hammer failed with a screw. You say that it wasn't made for this.
> We humans invented mathematics.
I am aware of that :)
> ... A more reallistic way of thinking indicates that thanks to the structure of the matter, mathematics are as they are.
I disagree, please read next.
> Even though, only one general set of axioms exists and may be used to explain this universe, because it is the universe's structure what makes 1+1 be 2
There are many set of axioms. The "explanation of the universe" is a human activity, as maths. "1 + 1 = 2" in a particular set of axioms and with certain definitions of what 1, +, = and 2 are.
This particular set has been useful for that activity, the "explanation of the universe".
> ... in some physical subatomic and quantic situations, unexpected things occur, not predicted by maths.
Again, maths is just a tool. You should say "not predicted by physics", as it is said in any decent physics chemistry, biology etc worth its ink.
> ... be plaged of "addons" to continue being a good
I am sensing strong pejorative terms here. Short clarification: mathematics has been growing thanks to the problems faced by the sciences that have used them.
But there is another way: the what-if of mathematicians. Many of the what-ifs have been proved useful later - which makes mathematicians proud. But there are many that are not. Are those what-ifs the "add-ons" you are not happy with? Please let history be the judge.
> the origin of the word Bit - it is really interesting
Yes, indeed! Invented by a statistician, Tukey, not Shannon. See
https://en.wikipedia.org/wiki/Bit, and this
http://www.nytimes.com/2000/07/28/us/john-tukey-85-statistician-coi...
> .. you loose precisely all the interesting information. To be a true data scientist you can not fall in this error ...
Your lost is a choice, it is called abstraction; you "abstract" (take away) what are interested in, the rest is discarded. If you thing you are loosing "interesting information", that is ok, change your abstraction, it is not maths fault. And I am pretty sure maths can help there too, after all, it is a toolbox (i.e. many tools inside).
> you had to invent special maths ... to reach this knowledge.
Again maths lead to knowledge because is a tool. By the way, I don't have to invent them, statistical methods to del with data in unstructured ways already exist - and are being developed as we speak.
- - -
From your last post
> Maths are based on deduction and not induction.
Again, maths is a tool. Deduction and induction are part of the scientific method. If you want to learn about foundations of maths (which I believe it is what you are referring to), I kindly suggest to read about the Hilbert program, Russell and Whitehead, and then the conclusion in Godel and Turing.
> you can do Data Science without maths
I would love to see that!
> using semantics, lingüistics, semiotics and other products of our brain that are not necessarily related to maths
Well, Carlos, all sciences more or less have used maths.
Semantics and Semiotics: I consider them part of social sciences, even philosophy. That is a longer discussion.
> In fact, computer algorithms are not maths ... can be programmed against mathematical laws....
Yes, there is the study of algorithms (a word originated honouring a mathematician, by the way) from a mathematical perspective. It started with the foundational work I mentioned above. If your "antiMathSum" is useful in your program, by all means, use it! (On this, I believe you are confusing levels of discussion here, but well...)
> ... using only data analysis) you will stop in the step 1 of Data Science
I think all I said about abstraction applies in this part of your argument. I won't repeat myself.
However, you have to listen to a statistician.Try this
http://www.chalmers.se/en/areas-of-advance/ict/calendar/Pages/Terry...
Let me summarise what I find most interesting in his talk. In short:
questions + data -> models -> answer + (measure of) uncertainty
How brings the questions and the data? Scientists, Business people, anyone.
Models? Yeap, mathematical models, built in collaboration (!) and nowadays including randomness
Answer? It becomes an answer if it satisfies who posted the problem.
Uncertainty? Statistics is the part of maths that took that as the challenge.
(I could elaborate more on this model of Speed's but this is too long already)
> A mathematician is only expert in maths, sadly. I hope you find courage to start this trip!!!
I have, many times, I have collaborated in all sort of things - lot more modestly than Tukey I am pretty sure, but still.
"The best thing about being a statistician,'' Mr. Tukey once told a colleague, ''is that you get to play in everyone's backyard.'' (from the post I linked above)
- - -
Carlos, I sense a lot of negativity (if you allow me to use the word) in Mr. Granville's post as a whole and in some of your words. I know that is not fair to extract few sentences here and there from what you said, but I tried to take the important points, to clarify my view. Your reply was quite long.
Is Mr Granville's rant justified?
I don't know. Just let me tell you that, yes, mathematicians (and statisticians) can be obnoxious and too proud at times; after all, you have to leg 2000+ years of development to master a tiny bit, and you have to acknowledge their hard work to them.
But any mathematician or statistician worth its title should humble themselves before the challenges brought by science and now businesses ("big data") to them.
If they don't, kindly reminding them of Godel's incompleteness result :) Although, bare in mind, there is lot of people looking for appreciation, or wanting power, or something - think of it as a little bit of insecurity then, we are all humans after all.
But please, do not apply that to everybody!! History has shown many times that this kind of "abstractions" are dangerous, truly.
Sorry for flooding! I think I can reduce the last to something simpler: Where is the Scientific Method in Maths? Maths are based on deduction and not induction. Greeks spent a lot of years deducing and doing simple Maths before Aristoteles. Statistics have something pretty similar to the scientific method, but it is not exactly the same. Imagine what happens if you use the scientific method to analyze Data: Data Science. It is the "Science" word in Data Science what matters for understanding what Data Science is. Data Science is not simply the mathematical or statistical analysis of Data, it is more than that. In fact, you can do Data Science without maths, using semantics, lingüistics, semiotics and other products of our brain that are not necessarily related to maths. We need computers, thats a fact, to process abussive amounts of data. In fact, computer algorithms are not maths, are logic and can be programmed against mathematical laws, see: "antiMathSum(a,b){a+b+1}". And at the end, computer and algorithms is what we use to process data in data science. Another typical error for most people is pretend that data is something physical or mathematical that can be managed automathically without human intervention, to produce knowledge. Data is created by humans (or machines programmed by humans). It is something more complicated. When I say "linearizing" at the end of my last commend, I mean that you can not simplify data as coordinates, sets, matrices or some other mathematical representations without risking loosing the most important information. While simple Data Analysis has been enough before today, today it isn't. If you try Big Data only this way (using only data analysis) you will stop in the step 1 of Data Science (the statistical/mathematical analysis) that consists in characterizing the data and finding the most trivial signals in it (if any detectable this way...). To adquire true knowledge, you need to apply the Scientific Method and use your knowledge as an expert of the field. A mathematician is only expert in maths, sadly. I hope you find courage to start this trip!!!
All right Carlos, I agree. We simply have different perspectives. You are a scientist focused on mathematical theory. I am informatician and natural sciences PhD. I am a Scientific Method focused scientist and only see mathematics as a tool we humnas developed to easily explain the world and be able to increase our knowledge about the universe in which we live. From my way of thinking mathematics is the human representation of the physical properties of our universe, not the reverse. Being possible that in other universe or in different conditions, mathematics fail. For you, mathematics axioms that are true by themselve are all. I don't believe that. In fact, if you had enough knowledge about physics (that you probably have) you will know that in some special circumstances mathematical asumptions fail. We humans invented mathematics. Never forget that. We invented them to explain with precission the universe that we observe. Not the reverse (even though it is posible to use them to predict things using the Scientific Method). The scientific method not pretends to discover the truth, but aproximate enough to it to make conclusions that lead to useful predictions, while the truth may be completely different to what we thik it is. It is a possitive way of thinking: If it is useful to explain the world, it is ok. Mathematics are a human tool. Scientific Method (that uses our best tool : maths but also others) is our best empirical aproximation to the truth. Empirism is a part of the phylosophy based on induction not deduction. An hidrogen atom's nucleous is formed by a proton and a neutron, but can have different number of neutrons, and the electron in this atom may be located in a high probability region (orbital 1s) not because mathematics or statistics force the universe to be as it is. A more reallistic way of thinking indicates that thanks to the structure of the matter, mathematics are as they are. In fact, you, as matematician, probably know that you can create different maths (imaginary maths) without any problem. And that some of them have bee necesary to explain certain physical phenomena. Even though, only one general set of axioms exists and may be used to explain this universe, because it is the universe's structure what makes 1+1 be 2. Not the reverse. And as I said, in some physical subatomic and quantic situations, unexpected things occur, not predicted by maths and math had to be plaged of "addons" to continue being a good "tool". I think you want to call Data Science the mathematics of "data", call them Data Maths then. If we use semantics, and information sciences to analyce what Data Science means even from an humanistic point of view, it is the sceince that uses Data as the niche to adquire knowledge. Like Natural Science uses nature. Like Information Science uses Information (Shannon laws etc if you don't know about them - the origin of the word Bit - it is really interesting). But, to finish, of course you are right too. If you reduce yourself to pure maths, then you are right. But let me tell you something. If you reduce the dinamics of a living cell to linearized models, you loose precisely all the interesting information. To be a true data scientist you can not fall in this error, as you are supposed to find the knowledge in the data, even though you had to invent special maths (even imaginary ones) to reach this knowledge.
Carlos, mathematics has been enriched over 2000 years by looking at data from real problems. Topology is just an abstraction that has been proved useful. Is it possible to find new concepts in topology (or statistics or maths) by looking at "application data coming from users all around the world"? Absolutely! For example, I came across a paper describing wavelet approximations in network graphs (to name 1 example), and there are countless.
To put it simply: mathematics is the true data science... it has been, always. That is why physics, chemistry, building, engineering, mechanics, computer science and everything today is based on it.
It is a huge huge toolset, and is still growing... "Nothing new"? Really? You have to read a little bit more, although I understand it is overwhelming even for full time professionals.
© 2018 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central