Do you want to become a ‘Data Scientist’? If yes, then the first step is to understand the basic terms and their usage.
A – Brief History
Data Science is not a new field as the statisticians were doing the job even before the computer invention. Though, the evolution of modern computing technologies empowered statisticians to solve a wide variety of practical problems with heavy number crunching and massive data storage. The terms ‘knowledge discovery’ and ‘data mining’ came widely in use in the late 1980’s after the invention of the database management system and the relational database management system. Later ‘Big data’ term published in the ACM Digital Library in 1997 after the database industry noticed the explosion of business data. In the late 1990’s, the term ‘Data Science’ inspired researchers and professionals and interchangeably replaced the word ‘statistician’.
B- Basic Concept
I- Big Data, Data Science & Machine Learning
Any data with three V’s i.e. Volume, Variety and Velocity is considered as Big Data. Big Data can’t be handled with conventional ways of data analysis and processing. Data science deals with Big Data and brings out meaningful insights. Due to its large scale, Data Science now depends on algorithms that try numerous possibilities to provide the best solution, here comes the Machine Learning.
II- Data Mining & Data Analytics
Machine Learning acts as a tool to identify unknown patterns in the Big Data and the process is called Data Mining, unlike Data analytics where the process starts with a specific hypothesis.
III- Big Data Analytics
The approach to breaking down a task into smaller pieces and assigned to different processors which could be geographically dispersed is called ‘Distributed Computing’. Big data analytics leverages distributed computing technologies to overcome computational challenges.
C- Technologies that Enable Data Science Into Reality
– Data Infrastructure: It supports data sharing, processing, and consumption. Distributed computing and cloud computing is most popular these days.
– Data Management: DBMS plays an important role to store structured and unstructured data sets. Since a majority of business-related data is structured, SQL knowledge is still invaluable.
– Visualization: It is very important to communicate newly acquired insights to the leadership and rest of the organization so data visualization technologies play an equally important role.
D- Data Science Applications
Data Science can be applied where ever ‘Big Data’ is involved. Following are only a few examples:
- Fraud detection
- Social Media Analytics
- Online matchmaking or dating services
- Weather forecast
- Network Security…etc.
E- Must Have Skills for Data Scientist
Developing a reasonable understanding of statistics is a must for a data scientist as it lays the foundation of data science. At a minimum, a data scientist needs to be proficient with concepts such as probability, correlation, variables, distributions, regression, null hypothesis significance tests, confidence intervals, t-test, ANOVA, and chi-square. At an advanced stage, Data Scientist needs concepts and algorithms such as logistic regression, support vector machines (SVMs) and Bayesian method. Common statistical analysis tools such as Excel, R and SAS are very famous among Data Scientist.
II- Data mining
- Classification – Labelling a group of data objects into a specific category.
- Prediction – Building a model that produces continuous or ordered values that form a trend.
- Clustering – Grouping similar data objects into a class…etc.
- Natural Language Processing – NLP refers to different ways for a computer to interact with humans through a natural language. Computer science, Artificial Intelligence (AI), Computer linguistics and Human-computer interaction (HCI) are different areas of NLP. Some of the NLP aspects which are specifically related to Data Science are Tokenization, parsing, sentence, segmentation and named entity recognition. Python programming language is very famous and a recommended tool for having well-developed NLP tools.
- Tokenization and Parsing: Isolate each symbol from a text and conduct a grammatical analysis
- Sentence segmentation: Separates one sentence from the other in a text.
- Named entity recognition: Identifies which text symbol maps to what types of proper names
- Machine Learning (Supervised & Unsupervised)
- Visualization – Softwares are already available in the market that offers comprehensive visualization tools for data scientist such as Tableau. But it is important to remember that Data Scientist always acts as a middleman between data pile up and decision makers.
F- Roles and Responsibilities
Data Scientist or Engineer
A data scientist can work in any organization who is having data and willing to analyze its performance and future prediction. The role is more of a generalist instead of a specialist. A data scientist works with other data science specialist such as machine learning specialist.
Machine Learning Specialist
It’s a highly creative and independent role where you need the discipline to follow through and meet deadlines. Paying attention to details and quality is critical. Math and IT skills are essential as they form the foundations of the machine learning scientist. Deep knowledge of statistics and probability, ability to develop and validate a mathematical model, translating a model into an algorithm, proficiency in the programming language (Python, C++, Java, R…etc.), understanding of distributed computing are essential skills for a Machine Learning Specialist.
G- Related Certifications
- MCSE Business Intelligence Certification
- Cloudera Certified Professional or CCP data scientist
- Cloudera Certified Developer for Apache Hadoop or CCDH
- Cloudera Certified Administrator for Apache Hadoop or CCAH
- Cloudera Certified Specialist in Apache HBase or CCSHB
- EMC Data Science Associate (EMCDSA)
- EMC Data Center Architect or EMCDCA
- EMC Cloud Architect or EMCCA
- Oracle BI Implementation Specialist …etc.
H- Final Words
- Data Scientist must keep refreshing their knowledge to stay up to date. Attending conferences, workshops, peer networking and continuing education are ways to stay updated.
- Cloud vendors like Amazon, IBM, and Google …etc. makes it cheaper for companies to use cloud computing facilities instead of private in housed resources, which in turn increases the demand for Data Scientists. Even Data Scientist no longer worries about data infrastructure and management problems due to emerging online services.
- The importance of Machine Learning is growing especially deep learning taking advantage of neural networking is getting more traction.
For reference and details visit: First Step To Become a Data Scientist