Do you want to become a ‘Data Scientist’? If yes, then the first step is to understand the basic terms and their usage.
Data Science is not a new field as the statisticians were doing the job even before the computer invention. Though, the evolution of modern computing technologies empowered statisticians to solve a wide variety of practical problems with heavy number crunching and massive data storage. The terms ‘knowledge discovery’ and ‘data mining’ came widely in use in the late 1980’s after the invention of the database management system and the relational database management system. Later ‘Big data’ term published in the ACM Digital Library in 1997 after the database industry noticed the explosion of business data. In the late 1990’s, the term ‘Data Science’ inspired researchers and professionals and interchangeably replaced the word ‘statistician’.
I- Big Data, Data Science & Machine Learning
Any data with three V’s i.e. Volume, Variety and Velocity is considered as Big Data. Big Data can’t be handled with conventional ways of data analysis and processing. Data science deals with Big Data and brings out meaningful insights. Due to its large scale, Data Science now depends on algorithms that try numerous possibilities to provide the best solution, here comes the Machine Learning.
II- Data Mining & Data Analytics
Machine Learning acts as a tool to identify unknown patterns in the Big Data and the process is called Data Mining, unlike Data analytics where the process starts with a specific hypothesis.
III- Big Data Analytics
The approach to breaking down a task into smaller pieces and assigned to different processors which could be geographically dispersed is called ‘Distributed Computing’. Big data analytics leverages distributed computing technologies to overcome computational challenges.
– Data Infrastructure: It supports data sharing, processing, and consumption. Distributed computing and cloud computing is most popular these days.
– Data Management: DBMS plays an important role to store structured and unstructured data sets. Since a majority of business-related data is structured, SQL knowledge is still invaluable.
– Visualization: It is very important to communicate newly acquired insights to the leadership and rest of the organization so data visualization technologies play an equally important role.
Data Science can be applied where ever ‘Big Data’ is involved. Following are only a few examples:
Developing a reasonable understanding of statistics is a must for a data scientist as it lays the foundation of data science. At a minimum, a data scientist needs to be proficient with concepts such as probability, correlation, variables, distributions, regression, null hypothesis significance tests, confidence intervals, t-test, ANOVA, and chi-square. At an advanced stage, Data Scientist needs concepts and algorithms such as logistic regression, support vector machines (SVMs) and Bayesian method. Common statistical analysis tools such as Excel, R and SAS are very famous among Data Scientist.
II- Data mining
A data scientist can work in any organization who is having data and willing to analyze its performance and future prediction. The role is more of a generalist instead of a specialist. A data scientist works with other data science specialist such as machine learning specialist.
It’s a highly creative and independent role where you need the discipline to follow through and meet deadlines. Paying attention to details and quality is critical. Math and IT skills are essential as they form the foundations of the machine learning scientist. Deep knowledge of statistics and probability, ability to develop and validate a mathematical model, translating a model into an algorithm, proficiency in the programming language (Python, C++, Java, R…etc.), understanding of distributed computing are essential skills for a Machine Learning Specialist.
For reference and details visit: First Step To Become a Data Scientist