Do you want to become a ‘Data Scientist’? If yes, then the first step is to understand the basic terms and their usage.

Data Science is not a new field as the statisticians were doing the job even before the computer invention. Though, the evolution of modern computing technologies empowered statisticians to solve a wide variety of practical problems with heavy number crunching and massive data storage. The terms ‘knowledge discovery’ and ‘data mining’ came widely in use in the late 1980’s after the invention of the database management system and the relational database management system. Later ‘Big data’ term published in the ACM Digital Library in 1997 after the database industry noticed the explosion of business data. In the late 1990’s, the term ‘Data Science’ inspired researchers and professionals and interchangeably replaced the word ‘statistician’.

**I- Big Data, Data Science & Machine Learning**

Any data with three V’s i.e. Volume, Variety and Velocity is considered as Big Data. Big Data can’t be handled with conventional ways of data analysis and processing. Data science deals with Big Data and brings out meaningful insights. Due to its large scale, Data Science now depends on algorithms that try numerous possibilities to provide the best solution, here comes the Machine Learning.

**II- Data Mining & Data Analytics**

Machine Learning acts as a tool to identify unknown patterns in the Big Data and the process is called Data Mining, unlike Data analytics where the process starts with a specific hypothesis.

**III- Big Data Analytics**

The approach to breaking down a task into smaller pieces and assigned to different processors which could be geographically dispersed is called ‘Distributed Computing’. Big data analytics leverages distributed computing technologies to overcome computational challenges.

**– Data Infrastructure:** It supports data sharing, processing, and consumption. Distributed computing and cloud computing is most popular these days.

**– Data Management:** DBMS plays an important role to store structured and unstructured data sets. Since a majority of business-related data is structured, SQL knowledge is still invaluable.

**– Visualization:** It is very important to communicate newly acquired insights to the leadership and rest of the organization so data visualization technologies play an equally important role.

Data Science can be applied where ever ‘Big Data’ is involved. Following are only a few examples:

- Fraud detection
- Social Media Analytics
- Online matchmaking or dating services
- Weather forecast
- Simulation
- Network Security…etc.

**I- Statistics**

Developing a reasonable understanding of statistics is a must for a data scientist as it lays the foundation of data science. At a minimum, a data scientist needs to be proficient with concepts such as probability, correlation, variables, distributions, regression, null hypothesis significance tests, confidence intervals, t-test, ANOVA, and chi-square. At an advanced stage, Data Scientist needs concepts and algorithms such as logistic regression, support vector machines (SVMs) and Bayesian method. Common statistical analysis tools such as Excel, R and SAS are very famous among Data Scientist.

**II- Data mining**

**Classification**– Labelling a group of data objects into a specific category.**Prediction**– Building a model that produces continuous or ordered values that form a trend.**Clustering**– Grouping similar data objects into a class…etc.- Natural Language Processing – NLP refers to different ways for a computer to interact with humans through a natural language. Computer science, Artificial Intelligence (AI), Computer linguistics and Human-computer interaction (HCI) are different areas of NLP. Some of the NLP aspects which are specifically related to Data Science are Tokenization, parsing, sentence, segmentation and named entity recognition. Python programming language is very famous and a recommended tool for having well-developed NLP tools.
- Tokenization and Parsing: Isolate each symbol from a text and conduct a grammatical analysis
- Sentence segmentation: Separates one sentence from the other in a text.
- Named entity recognition: Identifies which text symbol maps to what types of proper names

**Machine Learning**(Supervised & Unsupervised)**Visualization**– Softwares are already available in the market that offers comprehensive visualization tools for data scientist such as Tableau. But it is important to remember that Data Scientist always acts as a middleman between data pile up and decision makers.

A data scientist can work in any organization who is having data and willing to analyze its performance and future prediction. The role is more of a generalist instead of a specialist. A data scientist works with other data science specialist such as machine learning specialist.

It’s a highly creative and independent role where you need the discipline to follow through and meet deadlines. Paying attention to details and quality is critical. Math and IT skills are essential as they form the foundations of the machine learning scientist. Deep knowledge of statistics and probability, ability to develop and validate a mathematical model, translating a model into an algorithm, proficiency in the programming language (Python, C++, Java, R…etc.), understanding of distributed computing are essential skills for a Machine Learning Specialist.

- MCSE Business Intelligence Certification
- Cloudera Certified Professional or CCP data scientist
- Cloudera Certified Developer for Apache Hadoop or CCDH
- Cloudera Certified Administrator for Apache Hadoop or CCAH
- Cloudera Certified Specialist in Apache HBase or CCSHB
- EMC Data Science Associate (EMCDSA)
- EMC Data Center Architect or EMCDCA
- EMC Cloud Architect or EMCCA
- Oracle BI Implementation Specialist …etc.

- Data Scientist must keep refreshing their knowledge to stay up to date. Attending conferences, workshops, peer networking and continuing education are ways to stay updated.
- Cloud vendors like Amazon, IBM, and Google …etc. makes it cheaper for companies to use cloud computing facilities instead of private in housed resources, which in turn increases the demand for Data Scientists. Even Data Scientist no longer worries about data infrastructure and management problems due to emerging online services.
- The importance of Machine Learning is growing especially deep learning taking advantage of neural networking is getting more traction.

For reference and details visit: First Step To Become a Data Scientist

I am a professional engineer, enthusiast programmer, passionate data scientist and machine learning student. You can contact me through [email protected] or visit https://engmrk.com

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central