Subscribe to DSC Newsletter

Analytics has taken world by storm & It it the powerhouse for all the digital transformation happening in every industry.

Today everybody is generating tons of data – we as consumers leaving digital footprints on social media,IoT generating millions of records from sensors, Mobile phones are used from morning till we sleep. All these variety of data formats are stored in Big Data platform. But only storing this data is not going to take us anywhere unless analytics is applied on it. Hence it is extremely important to close the loop with Analytics insights.

Here is my version of A to Z for Analytics:

  • Artificial Intelligence:: AI is the capability of a machine to imitate intelligent human behavior. BMW, Tesla, Google are using AI for self-driving cars. AI should be used to solve real world tough problems like climate modeling to disease analysis and betterment of humanity.
  • Boosting and Bagging: it is the technique used to generate more accurate models by ensembling multiple models together
  • Crisp-DM: is the cross industry standard process for data mining.  It was developed by a consortium of companies like SPSS, Teradata, Daimler and NCR Corporation in 1997 to bring the order in developing analytics models. Major 6 steps involved are business understanding, data understanding, data preparation, modeling, evaluation and deployment.
  • Data preparation: In analytics deployments more than 60% time is spent on data preparation. As a normal rule is garbage in garbage out. Hence it is important to cleanse and normalize the data and make it available for consumption by model.
  • Ensembling: is the technique of combining two or more algorithms to get more robust predictions. It is like combining all the marks we obtain in exams to arrive at final overall score. Random Forest is one such example combining multiple decision trees.
  • Feature selection: Simply put this means selecting only those feature or variables from the data which really makes sense and remove non relevant variables. This uplifts the model accuracy.
  • Gini Coefficient: it is used to measure the predictive power of the model typically used in credit scoring tools to find out who will repay and who will default on a loan.
  • Histogram: This is a graphical representation of the distribution of a set of numeric data, usually a vertical bar graph used for exploratory analytics and data preparation step.
  • Independent Variable: is the variable that is changed or controlled in a scientific experiment to test the effects on the dependent variable like effect of increasing the price on Sales.
  • Jubatus: This is online Machine Learning Library covering Classification, Regression, Recommendation (Nearest Neighbor Search), Graph Mining, Anomaly Detection, Clustering
  • KNN: K nearest neighbor algorithm in Machine Learning used for classification problems based on distance or similarity between data points.
  • Lift Chart: These are widely used in campaign targeting problems, to determine which decile can we target customers for a specific campaign. Also, it tells you how much response you can expect from the new target base.
  • Model: There are more than 50+ modeling techniques like regressions, decision trees, SVM, GLM, Neural networks etc present in any technology platform like SAS Enterprise miner, IBM SPSS or R. They are broadly categorized under supervised and unsupervised methods into classification, clustering, association rules.
  • Neural Networks: These are typically organized in layers made up by nodes and mimic the learning like brain does. Today Deep Learning is emerging field based on deep neural networks.
  • Optimization: It the Use of simulations techniques to identify scenarios which will produce best results within available constraints e.g. Sale price optimization, identifying optimal Inventory for maximum fulfillment & avoid stock outs
  • PMML: this is xml base file format developed by data mining group to transfer models between various technology platforms and it stands for predictive model markup language.
  • Quartile: It is dividing the sorted output of model into 4 groups for further action.
  • R: Today every university and even corporates are using R for statistical model building. It is freely available and there are licensed versions like Microsoft R. more than 7000 packages are now available at disposal to data scientists.
  • Sentiment Analytics: Is the process of determining whether an information or service provided by business leads to positive, negative or neutral human feelings or opinions. All the consumer product companies are measuring the sentiments 24/7 and adjusting there marketing strategies.
  • Text Analytics: It is used to discover & extract meaningful patterns and relationships from the text collection from social media site such as Facebook, Twitter, Linked-in, Blogs, Call center scripts.
  • Unsupervised Learning: These are algorithms where there is only input data and expected to find some patterns. Clustering & Association algorithms like k-menas & apriori are best examples.
  • Visualization: It is the method of enhanced exploratory data analysis & showing output of modeling results with highly interactive statistical graphics. Any model output has to be presented to senior management in most compelling way. Tableau, Qlikview, Spotfire are leading visualization tools.
  • What-If analysis: It is the method to simulate various business scenarios questions like what if we increased our marketing budget by 20%, what will be impact on sales? Monte Carlo simulation is very popular.
What do think should come for X, Y, Z?

Views: 3093

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Muhammad Kashif Iqbal on Sunday

XGBoost is a library designed and optimized for boosting trees algorithms. Gradient boosting trees model is originally proposed by Friedman et al. The underlying algorithm of XGBoost is similar, specifically it is an extension of the classic gbm algorithm.

Comment by Chandrasekhara S. ("C.S.") Ganti on Friday

Yes, this is a well know summary from DS folks on the use and usability of these various software and ther relevance and applicability in various domain space..

Comment by Arupratan Bagchi on April 20, 2017 at 3:12pm
Excellant Post
Comment by Mukesh Sehgal on April 20, 2017 at 10:58am

Y: YARN (Yet Another Resource Negotiator).

Z: Z-Test - Statistical test that can be run for a normally distributed data. This test can be used by anyone to determine whether sample size will give the target result (based on variance and standard deviation). For E.G. if car part manufacturer wants to make sure that average size of the part will fall under target size, he should be running this test.

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service