(This is a paper presented in a local meet of statisticians; a summary of various discussions)
Data analysis, one of the main requirements for Research has transformed into ‘Data Science’ which is considered as one of the most important concepts in the current internet enabled scenario. May be it is for a different purpose, where the requirement of manpower in data analysis related issues is huge. Business decisions started moving towards data aided decisions and the availability of data and information infrastructure have created a situation where Statistics is termed as the sexiest job of the new century. (Davenport,T.H. and Patil, D.J., Data Scientist: The Sexiest Job of the 21st Century, Harvard University). An attempt is made to provide a concise account of the evolution of the concept ‘Data Science’ over the last few years.
With an expanding scenario in computing facilities as well as research efforts in various disciplines, the role of Statistics for data analysis either for experiment based data, primary sample data or secondary data gained enormous importance. Inclusion of ‘Management’ as a separate discipline of study started creating professionals in management and data analysis found a new place with more and more business decisions taken depending upon statistical evidence. With better computing facilities, data analysis got enlarged into Management Information System (MIS) which resulted in the emergence of Decision Support Systems (DSS). The role of statistical methods in DSS is primary in nature. These concepts got converted into another concept called Business Intelligence (BI) where analytical solutions leading to the knowledge of the existing state of the business systems were the main attraction. When the knowledge so obtained were used in deciding future course of actions in a business system, a broader concept was designed, i.e., Business Analytics (BA), which included BI. Currently, BA is an important area where professional Statisticians are also in demand.
BA needed tools that included Mathematics, Statistics, Commerce, Economics, Data Mining, Data Visualisation and others and large business operations started outsourcing their requirements. Every day the requirement for manpower on Data Analysis is increasing justifying the content of the article by Davenport and Patil referred to earlier. During the last few years Data storage developed manifold with better Data Warehousing methods. This, in turn, resulted in a situation where large quantity of data got stored and methods required handling them. This gave rise to the term ‘Big Data’. There is enormous potential and manpower is very low for obvious reasons.
This situation was further supported by more powerful computer infrastructure both for computing and communication that led to make the computers to deliver solutions without human intervention resulting in concepts of Machine Learning as part of data analysis. This is an area where Algorithms, High-end Probability Models including Bayesian Principles are used in various ways.
A separate discipline came up to be called as Data Science that includes Mathematics, Statistics, Probability, Business Analytics, Predictive Analytics, Data Acquisition, Data Warehousing, Data Communications, Programming facilities and Machine Learning among others. Concise accounts of these concepts are provided. Initially, these efforts started with John.W.Tukey publishing a paper “The Future of Data Analysis” in 1962. With more additions to stored program concepts in electronic computers Data Analysis started to have a different dimension. Tukey was the one who coined the term ‘bit’ which is used by Shannon in ‘A Mathematical Theory of Communications’ With the publication of ‘Exploratory Data Analysis’ by J.W.Tukey in 1977, statistical methods started moving from academic requirements to the exploration of data organized under Data Warehousing principles. This may be taken as the starting point in the evolution of Data Analysis to Data Science. With the publication of ‘Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics’ by William. S. Cleveland the new discipline was placed in the computer framework and the relevant methods, like data mining, used in exploratory data analysis. Currently, this seems to be one of the attractions of both academics and users.
Early Stages in Statistical Data Analysis:
Use of statistics as a data analysis requirement seems to have its origin in the work of four British Statisticians; Francis Galton, Karl Pearson, William Sealy Gosset and Ronald Fisher. Sir Francis Galton, a half cousin of Charles Darwin, was the exponent of ‘eugenics’ a term coined by him which is a study relating to betterment of human race by selective breeding. He made some important contributions to Mathematics and Statistics. He developed on the idea of Correlation and regression, taking the clue from a Geologist by name Baver. His partial work on these concepts were seen to be put on a mathematical footing by his junior Karl Pearson, who is considered as one among those responsible for Mathematical Statistics. Next in line was W.S.Gosset who had expertise in Mathematics and Chemistry. Guinness Brewery in Ireland sought his help to bring in more consistency in the quality of Beer produced by them. However, brewing is a time consuming process and Gosset had to settle for small samples to arrive at decisions on quality. Thus, he developed small sample techniques in statistics and published his findings in a knick name Student as his employers were hostile to such publications. Then came Ronald Fisher who studied Biology and Genetics apart from Mathematics and he had to find an employment as he was relatively poor. Thus, Rothamstead Experimental Station gained his expertise and he developed principles of experiment and one of the most used statistical tools, Analysis of Variance. Further he went on to bring out his major contribution “Foundations of Mathematical Statistics” in 1925. Thus, the initial phase in the development of Statistics had data analysis as its core element.
Impact of Computers on Statistics:
Formal education in Statistics in early days required the student to be trained in practical application of statistical methods. Initially paper and pencil were the only computing facility, perhaps with the assistance of Clarke’s Tables. Subsequently, Manual Calculators, Electrical Calculators, Slide Rule, Nomography and other tools were taught as part of training in Statistics. Electronic calculators came only during early seventies. With the development of ‘Stored Program Concept’ computers started to wield influence in the use of Statistical methods, mainly for Academic Research in India. One needed to know some programming language to make the computer to solve statistical applications. This situation continued until the emergence of personal computers and statistical software. Now a host of programming environment along with large collection of software made statistical applications very user friendly. The current expectation about programming environments for Data Analysis under a Data Science principle is the use of Python and R.
With computer infrastructure getting expanded and with network principles making communication easier, data analysis started to help business decisions with evidence gained from the data related to their past behaviour. Tukey’s publication ‘Future of Data Analysis’ and ‘Exploratory Data Analysis’ provided the direction for the non statistical professionals in business to go forward in data aided decision making efforts. ‘Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics’ by William. S. Cleveland provided the framework for putting the decision making process on the computer network, surpassing all existing principles like MIS, DSS and others. The addition of a concept called ‘Big Data’ has transformed the professional side of data handling either by a mathematician, statistician, professional managers or computer experts as an exciting career.
Some of these principles are expanded below to provide insight to education managers to see what is required and what is not available. Features around Business Analytics, Big Data and Data Science are explained to locate the place for trained personnel both in theoretical and practical aspects of Mathematics, Statistics and Computer Science.
Definition: Business Analytics refers to the methodology employed by an organization to enhance its business and make optimized decisions by the use of statistical techniques i.e. collecting data, assembling and analyzing it to better their products, services, marketing etc.
Terms which are associated with Business Analytics are MIS, ERP, DSS, BI, Data Mining, Data Warehouse and others. The term "business analytics" is loosely used by some to describe a set of different procedures including data mining and other analytic methods. However, this includes more concepts and procedures.
Business Analytics can be considered in three different perspectives; historical data based analytics, experiment based analytics and real-time analytics.
Historical data based analytics use accrued data in an organized form to identify patterns of important concepts related to the problem, like customer behavior, employee satisfaction in Marketing Problems and try to relate them to the functions required like, customer selection, loyalty and service in such problems.Experiment based analytics relate to the use of data analytics to measure the overall impact of possible interventions to boost revenue in a continuous experimental framework. For example, to understand the drivers of financial performance experimental situations are to be created to understand the impact of different type of interventions in the existing process. As an example, Capital One can be observed conducting more than 30000 experiments in a year with different interest rates, incentives and other variables with a directed effort to identify variables that are responsible to maximize the number of good customers. Such methods are also called “Enterprise Approach”.
Real time analytics are concerned with high priority requirements in business operations. This will be a continuous activity involving capabilities to use continuous flow of data and take appropriate decisions to optimize their function. This issue may be explained in terms of some examples.
* A Logistic operator may be in a situation that requires monitoring transport network availability in relation to their quantum of operations across the country. With a real-time view into business events, operators are better informed, and they can make alternate plans to keep the deliveries on time and customers satisfied.
* A manufacturer may be required to track availability of power in relation to consumption of electricity at plants using related tools that provide notifications. With real-time visibility into power availability and demand, he can ensure cost-effective production schedule.
BA activities include:
* Exploring data to find new patterns and relationships (data mining)
* Explaining why a certain result occurred (statistical analysis, quantitative analysis)
* Experimenting to test previous decisions ( multivariate studies)
* Forecasting future results (predictive modeling, predictive analytics)
Answers are to be provided for questions arising out of these activities using tools based on new principles and concepts of Reporting, Monitoring, Alerting, Dashboards, Scorecards, OLAP and Ad hoc queries using Statistical Analysis, Data Mining, and Predictive Modeling and Multivariate methods and others.
Initially, in Academic research there used to be problems of insufficient data either due to cost of experiment or other issues in acquiring data. The problem seems to be reversed now, with enormous data availability along with cheap storage and user friendly data analysis tools. Conventional Data Mining tools have to be reinvented to accommodate the enormity of data. Technical issues relating to computing with such large quantities of data have given rise to new software. Hadoop with concepts like MapReduce have come to assist statistical computing. Rapidminer is another facility to mine enormous data. In short, Big Data is characterized by Three Vs, Volume, Velocity and Variety. As an example one can look at the kind of data getting added to a telecom system when a call is made using mobile communication.
Data Science emerged as a discipline to include all the requirements of BA and Big Data in an automated environment with facilities that includes Data Acquisition, Data Warehousing, Data Communication, Mathematics, Statistics, Machine learning and others. The role of Statistics and Mathematics need not be told except that high end Probability and Statistical techniques such as Bayesian Decision Processes, Markov Decision process and the like are used under different Algorithms where Mathematics has a primary role. A small description of what is aimed at is provided here.
This is a branch of Artificial Intelligence that is concerned with designs of systems that can learn from data. Arthur Samuel (1959) defined machine learning as “A field of study that gives the ability to learn without being explicitly programmed”. In this activity, mainly Data Mining is used to obtain knowledge from data and use such knowledge for prediction. In order to achieve the goal Machine Learning Algorithms are used. Depending upon the requirement, such algorithms may be classified as Supervised learning, Unsupervised learning, Semi-Supervised learning, Transduction, Reinforcement learning, Learning to learn, Development learning and the like. These are to be implemented using Computational Learning Theory, a branch of Computer Science. There seems to be lot of similarities between machine learning theory and statistical inference. Machine Learning algorithms classified earlier may use any of the available approaches in an appropriate manner. Decision Tree, Association Rule, Artificial Neural Networks, Inductive Logic Programming, Cluster Analysis, Bayesian Networks, Similarity and Dissimilarity based methods are some of the approaches. It is not difficult to see the complexity of Mathematical and statistical methods in this respect. Unless a person is in complete understanding of the theory such activities cannot be practical.
Learning is all about generalizing regularities in observed data to yet unobserved data. Good generalization depends upon how good one is in balancing prior information with information from data. One can observe something ‘Bayesian’ here. The required computational activity involves good classification of data. Principles of Nearest Neighbors Classification, Naïve Bayesian Classification find dominant place in such regularizations. These principles are seen to be used with discrete data. However, methods based on Logistic Regression Classifier and its relative Perceptron are also used in classifications.
Kernel programming is what is used at implementation level. In computing, Kernel is a computer program that manages input/output requests from software and translates them into data processing instructions for the CPU and other electronic components of a computer (Wikipedia). Algorithms for statistical methods need to be defined appropriately for kernel programming, leading to Kernel Regression, Kernel Clustering, Kernel PCA, Kernel Discriminant Function, Kernel Canonical Correlation Analysis and others.
Categories of work related to Data Science:
By way of summarizing the requirements that may characterize Data Scientists, a classification is provided by Vincent Granville, a forerunner in this field. Categories of Data Scientists include,
1. ‘Statisticians’, who may develop new theories related to Big Data, who are experts in Data Modeling, Sampling, Experimental Design, Clustering, Data Reduction, Prediction and others.
2. ‘Mathematicians’ who are experts in astronomy, geometry, operations research, optimization and the like as well as Algorithms.
3. ‘Data Engineers’ who are able with Hadoop, Data Base, File System architecture, Data Flows and others.
4. ‘Machine Learning Experts’ who can handle complex computer systems/algorithms.
5. ‘Professional Managers / Business experts’ who are good at ROI optimization and related tools such as Dashboards, Design of Performance Metrics.
6. ‘Software Engineers’ to produce codes for computer implementation
7. ‘Visualization Experts’, who may have insight to bring out the knowledge generated in visual facilities.
8. ‘Spatial Data Experts’ to generate Graphs and Graph Data Base.
The aim of educational managers, in the field of Mathematics, Statistics, Computer Science, Visual Media and Management should be to sit together and organize a training program to create employable data scientists. They are in great demand.
(All papers referred to in this article can be downloaded from websites free of cost by Google Search with the title as the key)