Till your good is better and better is best!ENSEMBLEYes, the above quote is so true. We humans have ability to rate things and find some or the other matrices to measure things and evaluate them or their performances. Similarly, in Data Science you can measure your model’s accuracy and performance! The very first model that you build, you can set its matrices from the results and these matrices become your benchmark. Moving ahead you strive to or try to make other models on the same data which can break those benchmark. Which can predict better, which has better accuracy and so on… matrices can be anything. This vital technique is known as ‘Ensemble’ where you build one model, set its performance as measuring matrices and try to build other models which are better performers then previous matrices. This is the topic of the week! We will learn how to Ensemble models on a very interesting “Diabetes” data.THE DATAThis data is taken from Kaggle and its best description is as follows provided on the portal: “The data was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above. ” Below is the R-Console Summary and Structure of this Data for better interpretation: Summary of Diabetes Data Structure of Diabetes DataHere, for this data we will build models to predict the “Outcome” i.e. Diabetes Yes or No and we will perform Ensemble techniques to better our predictions. We will be using several techniques to do that, each technique is briefed as we keep building our codes below. Here is the link to this data from where it can be downloaded.https://www.kaggle.com/kandij/diabetes-datasetTHE MODELEDA – Exploratory Data AnalysisThis is typically done to better understand our data before we actually start building our models. This model has a lot to do with CARET library of R. CARET has few functions from which you can easily plot some graphs for EDA – Exploratory Data Analysis. Below is the listings for loading the data and computing some EDA plots.Scatter plot Matrix Listing for Loading Data and First 4 data field Scatter plot Scatter plot matrix for first 4 Diabetes Data fields Listing for Scatter plot matrix of last 4 data fields Scatter plot matrix for last 4 Diabetes Data fieldsOverlayed Density plots Overlayed Density plots of first 4 data fields Listing for Overlayed Density plot for last 4 data fields Overlayed Density plots of last 4 data fieldsBox Plots Listing for Box plot of first 4 data fields Box plots for first 4 data fields Listings for Box Plot of last 4 data fields Box plots for last 4 data fieldsJust eyeballing through and juggling around all the plots above you can get a better idea of the data. This is why we do EDA to better understand our data.KNN – ModelThe very first model we will be building is ‘k nearest neighbor’ model. Further, for this model we will be using 10 folds cross validation and 10 repeats to evaluate the model performance and for the output we will focus on the accuracy. The accuracy for this model will be our benchmark for Ensemble. Below are the listings for building the knn model. Listing for building knn model knn model outputAs we can see above with 10 folds cross validation and 10 times repetition the best model formed was for k = 21 and equivalent accuracy for k = 21 is 0.7532 or 75.32%. This value is our bench mark value. Moving ahead building ensemble models we will strive to attain an accuracy better then 75.32%.BaggingBagging or Bootstrapping Aggregation is a process in which for any given data set multiple random subsets of data set is created. By running an algorithm on these data sets it is possible that same data samples get picked up in multiple subsets.Below is the listing for building bagging model and we will be building “tree bag” method for bagging. Listing for building bagging model Output of Bagging modelThe above is the output of the bagging model, and here there is a small hike in the model accuracy from 75.32% we have attained an accuracy of 75.58%. This is not a big leap but still a better version of the model in terms of accuracy.Random Forests ModelOur next model is the Random forests model. They consists of a large number of individual decision trees and each tree splits out in class prediction and the class. With the majority rule and class with most votes is the output. Below is the listing for building Random forests model. Listing for building random forests model Random forests modelAs we see above random forests model gives the accuracy for mtry = 2 which is 76.60% which is again better then our knn model accuracy of 75.32%.BoostingThis is how boosting works, it builds the model with the with a Machine Learning algorithm that does predictions on entire data set.In these predictions there are some misclassified instances that are used for the second model. The second model learns from the misclassified instances of the first model. This process is goes on until the stopping process. The final output is voted from all the outcomes. There are number of Boosting algorithms we will be working with Gradient Boosting Machines (GBM) and XGBoost. Listing for building GBM Model Output of Boosting ModelWith the boosting model we got the accuracy of 76.61% again a better accuracy then knn which was 75.32%.XGBoostBelow is the listing for XGBoost Listing for XGBoost Output of XGBoostThe above output states that the best model is one with colsample_bytree = 0.7 and max_depth = 3 was selected. And for that model the accuracy is 75.54% again which is better then our knn model accuracy.StackingThis is very interesting model to end with. Can be called as model of models that too without manipulations. And what does that mean? So far, we have built the models which were exposed to subset of data and not actual entire data and its relationship. Stacking runs multiple Machine Learning Algorithms (the ones we mention) on the entire original data set. Stacking gets the combined power of several ML algorithms through getting the final prediction by means of voting or averaging as we do in other types of ensembles. Below is the listing for building stacking model. Listing for Stacking Stacking resultsStacking gives us a plethora of output from which we can analyze the best outcome from multiple ML algorithms. In the results above if we look for accuracy, mean and median all the accuracy are above our knn-accuracy of 75.32. The highest accuracy that could be attained across all the models is 87.01% which is way better than our knn accuracy.CONCLUSIONEnsemble models can be deployed to get better results thus enhancing our final model performanceThe highest accuracy that we could attain is 87.01% which is much higher than our accuracy of first knn-model accuracy of 75.32%To go beyond ensemble one can focus on Feature engineering of the modelThis is all about ensemble modeling in data science, where we keep pushing the boundaries of our model matrices so that we can improve our model performance hence increase or model impact on the data-driven decision. It will always be a wise idea to use these techniques to better our model results rather than just building a single model and deploying it.Read more here. See More

]]>

Deploying Natural Language Processing for Product ReviewsIntroductionWe have data all around us and there are of two forms of data namely; tabular and text. If you have good statistical tools tabular data has a lot to convey. But it is really hard to get something out of the text, especially the natural language spoken text. So what is natural language? We, humans, have very complex language and natural language is the true form of human language which is spoken/written with sincerity also surpassing grammatical rules. To consider the best example where you can find this language is in “Reviews”. You write review mainly for two reasons, either you are very happy with the product or very disappointed with it and, with your reviews and a Machine Learning Algorithm, entities like Amazon can figure out whether the product they are selling is good or bad. Depending upon the results on the analysis of the reviews they can make further decisions on that product for their betterment.Consider an example wherein we have thousands of comments in a review of a product sold by Amazon. It is very time consuming and hard for an individual to sit back and read all the thousands of comments and make a decision about whether people are liking the product or not. We can simply make machines do this work by implementing a Machine Learning Algorithm from which Machines can learn the Natural Human Language and make decisions. An approach to this is Machine “bags” words basically of two types viz; positive and negative words. After bagging these words machines apply majority rules to identify: are there more of positive or negative words thus helping us identifying the acceptance of the product sold.The technique described above is called as “Bagging of Words (BoW)” Natural Language Processing (NLP) Machine Learning Algorithm which is what we are going to learn today and also could have been this complex name of this blog but phew, I managed with a better one. So, there are other techniques too called as Word Embedding NLP and might be more but as far as I know, they are these two. Today we are only focusing on “BoW” type of NLP in this article.BoW NLP Implementation Road MapImplementing this algorithm is little complex and can be confusing as it is longer than the usual algorithm process. So, below is the block diagram or a road map to give you an idea about the steps to implement this algorithm and finally make predictions with it. You can always fall back on this diagram if confused to understand why a particular thing is being done! Flow chart of implementing BoW NLP algorithm For machines to learn our natural language they need to be provided a certain format. Thus, we need to do data pre-processing so that we present data in a certain way expected by the machines to make them learn. In pre-processing we have to make sure data is in strictly two columns one for Rating and other for RatingText. Rating is either 1 or 2 (negative and positive respectively) and Rating Text will be equivalent rating text or review. Rating Text should not have unusual spacing and no punctuation. Below is the two snippets showing the data before and after pre-processing so that you understand better. Data before pre-processing Data after pre-processing The next step is creating Corpus (Bag of Words) which can be easily done by a function in R (addressed later during listings). From this corpus, we then create Document Term Matrix (DTM) wherein a matrix represents the number of times a word is repeated throughout the entire corpus. Once we have the document term matrix we then split our data into training and testing sets.The last step is implementing our model. Here we are deploying the Naive Bayes model, once implementing the model on the training data set we later make predictions on the test data set using the model created. In the end, we will compute the accuracy of our prediction to estimate how well our model performed in making predictions.Model DataThe data is Amazon review data set, it is open data set available for use which was made available by Xiang Zhang. It was constructed as review score 1 for negative and 2 for positive ratings, it has in total 1,800,000 training and 200,000 testing samples. Don’t worry we are just using 1000 of the instances to avoid computational complexities. The data is available for download at the link below:https://drive.google.com/file/d/1SncEr1yOBW4OmxWIYaKsNw_MS1Rh8S8j/view?usp=sharingJust you need to make sure, when you have the file downloaded convert it into .csvformat so that you can use for analysis in R.ModelPre-processing DataFirst things first as mentioned before we will begin with our data pre-processing. We will load our data first, followed by converting it into a data frame from a character string and then we will separate the data into two columns viz. “Rating” and “RatingText”. We can separate using a R function of the library “tidyr”. Below are the listings. Listing for loading 1001 lines of Amazon Review Data set View of the first 10 rows of the loaded data Listing for Separating data into 2 columns “Rating” & “RatingText” View of first 10 rows of data separated Now we need to make sure we retain only alphanumeric entries of the data and removing anything other than that. Below is the listing to do that for both the separated columns. Retaining only alphanumeric entries in the data Lastly, we need to remove unwanted spacing in the text. Below is the listing for removing the unusual spacing and maintaining a single spacing throughout Removing unusual spacing View of removed spacing data We now simply put our data into an object named “text” and check the data dimension to be sure that complete data is written to the object. Console view of defining an object “text” and storing our data The dimension print above of the text makes sure we have entire data into the object text.Creating CorpusOur next step includes creating corpus this can be easily done with a function VCorpus – Volatile Corpus from the library tm. To access this function one can download the package tm with install.packages(“tm”) followed by loading the library in R environment using library(tm). Once loaded we can use tm() function. This function will convert our review text into corpus. Listing for creating corpus using tm() function From the volatile corpus, we create a Document Term Matrix (DTM). A DTM is a sparse matrix that is created using the tm library’s DocumentTermMatrix function. The rows of the matrix indicate documents and the columns indicate features, that is, words. The matrix is sparse because all unique unigram sets of the dataset become columns in DTM and, as each review comment does not have all elements of the unigram set, most cells will have a 0, indicating the absence of the unigram.While it is possible to extract n-grams (unigrams, bigrams, trigrams, and so on) as part of the BoW approach, the tokenize parameter can be set and passed as part of the control list in the Document Term Matrix function to accomplish n-grams in DTM. It must be noted that using n-grams as part of the DTM creates a very high number of columns in the DTM. This is one of the demerits of the BoW approach, and, in some cases, it could stall the execution of the project due to limited memory. As our specific case is also limited by hardware infrastructure, we restrict ourselves by including only the unigrams in DTM in this project. Apart from just generating unigrams, we also perform some additional processing on the reviews text document by passing parameters to the control list in the tm library’s DocumentTermMatrix function. The processing we do on the review text documents during the creation of the DTM are:All the text will be in lower caseRemoving number, if anyRemoving Stop words such as a, an, in, and the. As stop words do not add any value in analyzing sentimentsRemoving Punctuation, again they do not add any meaning in the analyzing sentimentsPerform stemming, that is converting a word into its natural form example removing s from pluralsBelow is the listing for creating DTM from our Volatile corpus and doing the above mentioned pre-processing Listing for creating DTM and pre-processing Inspect View of our train_dtm Above we can see how our DTM is created where rows indicate document and columns indicate features as explained before. We see the output has 1001 documents with 5806 columns representing unique unigram from the reviews. We also see our DTM is 99% sparse and consists of non-zero entries of only 34,557 cells. The non-zero cells represent the frequency of occurrence of the word on the column in the document represent on the row of the DTM. The DTM tends to get very big, even for normal-sized datasets. Removing sparse terms, that is, terms occurring only in very few documents, is the technique that can be tried to reduce the size of the matrix without losing significant relations inherent to the matrix. Let’s remove sparse columns from the matrix. We will attempt to remove those terms that have at least a 99% of sparse elements with the following listings: Removing 99% of sparse elements Inspection after removing Sparse We now see the sparsity being reduced to 97% and the number of unigrams reduced to 688. We can now implement our algorithm but first let us split our data into training and testing sets. Splitting data into Training and Testing sets for implementing algorithm Once we have the data split we can now build our model on the Training set and later we can test our model for prediction on the testing set.Implementing Naive Bayes ModelOne point to be noted is that Naive Bayes model works on the nominal data. And our data currently has entrees which are for the number of times a word is being repeated and number of times a word gets repeated does not affect our sentiment thus we can convert our data into nominal data with following rule: number exists = Y and if 0 then N. Below is the listing to do this: Listing for converting data to Nominal Data Snippet view of converted nominal data Now as we have nominal data we can implement a Naive Bayes model. The Naive Bayes algorithm can be implemented with function naiveBayes() in R library e1071. If you do not have this library you can use the following codes to install the package followed by loading it into the R environment as follows: Listing to Install and use e1071 package Below is the listing to implement Naive Bayes model Listing to implement Naive Bayes model Finally, with the model in hand, we will now make predictions for the test data set and check for the accuracy of the predictions done by the model. Below are the listing and checking accuracy Listing for computing predictions and checking accuracy Accuracy obtained from the model With the quick implementation of BoW alongside with Naive Bayes model we got an accuracy of 79.5%. The accuracy can be increased further by focusing on the addition of different elements like new feature creation and so on.BoW NLP – TakeawaysBag of Words (BoW) Natural Language Processing (NLP) using a Naive Bayes model is very simple to implement algorithm when it comes to examining Natural language by Machines. Without very much efforts the model gives us a prediction accuracy of 79.5% which is a really good accuracy when it comes to simple model implementation. The only drawback of BoW is that it does not undertake the semantics of the words; example “buy a used bike” and “purchase old motorcycle”, BoW will treat them as two very different sentences. To overcome this we need to take another method known as Word Embedding into consideration.Read more hereSee More

IntroductionData Science and Machine Learning are furtive, they go un-noticed but are present in all ways possible and everywhere. They contribute significantly in all the fields they are applied and leave us with evidence which we can rely on and take data-driven directions. Today, a very interesting area we are going to see an example of Data Science and Machine Learning is ‘Crimes’. We are going to focus on types of crimes taken place across 50 states in the USA and cluster them. We cluster them for the following reasons:To understand state-wise crime demographicsTo make laws applicable in states depending upon the type of crime taking place most oftenGetting police and forces ready by the type of crime done in respective statesPredicting the crimes that may happen and thus taking measures in advanceThe above are the few applications which can be considered but they are not exhaustive, depending upon the data reports produced by the algorithms there can be more such applications which can be deployed. Our concern today is to understand how we can deploy Hierarchical clustering in two ways i.e. DIANA and AGNES for the USA Arrests data.TheoryHierarchical ClusteringA clustering algorithm that groups similar objects in the form of clusters. The scope for a detailed understanding of Hierarchical clustering in this blog is not feasible as it is a vast topic. To get a detailed understanding of Hierarchical clusters follow this link. By hitting the above link you will understand in and out of hierarchical clusters concerning their formation, the role of density and distances in hierarchical clustering with visuals diagrams. What we will focus today is on DIANA and AGNES hierarchical clustering.DIANA Hierarchical ClusteringDIANA is also known as DIvisie ANAlysis clustering algorithm. It is the top-down approach form of hierarchical clustering where all data points are initially assigned a single cluster. Further, the clusters are split into two least similar clusters. This is done recursively until clusters groups are formed which are distinct to each other. Notional representation of DIANATo better understand consider the above figure, In step 1 that is the blue outline circle can be thought of as all the points are assigned a single cluster. Moving forward it is divided into 2 red-colored clusters based on the distances/density of points. Now, we have two red-colored clusters in step 2. Lastly, in step 3 the two red clusters are further divided into 2 black dotted each, again based on density and distances to give us final four clusters. Since the points in the respective 4 clusters are very similar to each other and very different when compared to the other cluster groups they are not further divided. Thus, this is how we get DIANA clusters or top-down approached Hierarchical clusters.AGNES Hierarchical ClusteringAGglomerative NESting hierarchical clustering algorithm is exactly opposite of the DIANA we just saw above. It is an inside-out or bottoms-up approach. Here every data point is assigned as a cluster initially if there are n data points n clusters will be formed initially. In the next iteration, similar clusters are merged (again based on the density and distances), this continuous until similar points are clustered together and are distinct for other clusters. Notional representation of AGNESAs we can infer from the above representation for AGNES in step one all the data points are assigned as clusters. In step two again depending upon the density and distances the points are clubbed into a cluster. Lastly, in step 3 all the similar points depending upon density and distances are clustered together which are distinct to other clusters thus forming our final clusters.With so much theory and the working idea of DIANA and AGNES Hierarchical Clustering, we can now move on to build our model in R.The ModelDataThe data we are using today for modeling is an inbuilt R data in the MASS library. The data is named as ‘USArrests’, it has 50 observations (for 50 US States) with 4 variables or data fields. It has arrests per 100,000 residents (the Year 1973) for 4 different data fields which are as follows:$Murder – numeric – Murder arrests per 100,000$Assault – integer – Assault arrests per 100,000$UrbanPop – integer – Percent Urban Population$Rape – numeric – Rape arrests per 100,000Below shows the listing for loading the library followed by loading the USAssaults data set. Later images show the summary and structure of data set generated in R. Loading Data in R Summary and Structure of USArrestsHopkins StatisticOkay, we know we are clustering but, are we sure the data we are performing clustering can be clustered? Will the clusters that will be formed will be valid? To get answers to this question we will compute Hopkins Statistic. It is a measure to determine cluster tendency, the tendency takes a value between 0 and 1. And the cut off for our useful value is below 0.5, if the Hopkins statistic computed is below 0.5 then clusters are likely to be formed and valid clusters and if over 0.5 and closer to 1 the structures are not valid in the data set and are random.The ‘factoextra’ library has the function get_cluster_tendency() {load the library factoextra} this functions returns the Hopkins Statistic for the data. Below is the listing to compute the clustering tendency followed by the Hopkins static obtained. Listing for computing Hopkins Statistic Hopkins Statistic obtainedSince we obtained Hopkins statistic of less then 0.5 which is equal to 0.4141 we can say that the structures in the data set are not random and are real structures out of which clusters can be formed.Determining ‘k’The next big question to answer is what should be the optimal k value i.e. the number of clusters which are to be formed. There will be scenarios where you will be directly given a ‘k’ value depending upon the application. And there are as many as 30+ ways to determine this k-value. The most efficient and excellent way which is making use of NbClust() function of R library. It determines the optimal number of clusters by complex computations by varying all combinations of several clusters, distance measures, and clustering methods all at once. Once all the computations are done the output of k-value is given by majority rule. We can hence use the computed k-value. Below is the listing to compute k value using NbClust(). You may have to install followed by loading of NbClust() if not already in use. Deploying NbClust() Snippet of the summary generated by NbClust()As we can see above, NbClust suggests the optimal number of clusters should we form must be 2. Thus, we will go ahead with this value as k.Modeling DIANANow, we can proceed to model DIANA. For this, we will first load the library Cluster which has the function for both DIANA and AGNES. Below is the listing for computing DIANA and plotting dendogram of DIANA using pltree function. Listing for computing and DIANA and plotting DIANA dendogram Dendogram of DIANA, with rectangles indicating 2 clustersFor DIANA we have a metric called as divisive co-efficient which can be extracted as an attribute from the DIANA model formed as $dc. This is the measure which takes a value between 0 and 1 and closer to 1 the better. Listing for extracting DIANA attribute $dc Obtained Divisive co-efficient $dcAs we can see above the $dc value obtained is very close to 1 thus we can say the clusters are accurately formed and are valid. Now to see the names of the actual clusters formed and their relationships we use function cutree() that groups the clusters which we can use to print the clusters and see the cluster relationships. Using cutree() function to group clusters formed in DIANA Cluster Sizes of DIANAAs we can see from above Cluster 1 has 21 states and cluster 2 has 29 states clustered both making up to 50 states. Further, we can print the clusters and see the relationships between them and understand why the clusters were formed the way they were. Below is the listing for creating a function to print the clusters, followed by printing the DIANA clusters. Listing for creating a function to print the clusters Printing clusters using cluster print function Print of DIANA ClustersIn general comparison of the above two clusters, we can infer that the states are generally divided into two groups. The first group of states in cluster 1 are the states that have high murder, assault and rape cases per 100,000 compared to the states of cluster 2. Below is the mean comparison of Murder, Assault and Rape cases of both the clusters.MurderAssaultRapeCluster 111.825528.1Cluster 24.7109.716.2Thus, from so much inference we can understand that cluster 1 states have more average arrests per 100,000 compared to cluster 2 states. Cluster 2 states are safer then cluster 1 states.Lastly, we will visualize the clusters, we can visualize the cluster results using the function fviz_cluster() from the factoextra library. The plot is in two dimensions and is based on the first two Principal Components that explains the majority variance. Following is the listing for visualization Listing for visualizing DIANA clusters DIANA clusters – VisualizationModelling AGNESSo far whatever we have done for DIANA we will be executing the same for the AGNES clusters as well. Below is the entire listing for modeling, evaluating and visualizing AGNES clusters. Please refer the comments in the listings to get the idea about the listings. Listing for modelling, evaluating and visualizing AGNESAs we saw dc co-efficient in DIANA we have ac (agglomerative coefficient) for AGNES. Again, takes a value between 0 and 1. The closer to 1 the better and here we have 0.90 which is a good ac value indicating the clusters formed can be relied upon. ac co-efficient of AGNES Cluster table of AGNESBelow is the print of clusters formed by AGNES AGNES Clusters printAGNES clusters summary:MurderAssaultRapeCluster 111.8266.328.3Cluster 25.8122.817.8As we can see from the above summary table of clusters, AGNES has formed clusters which are relevant to DIANA in the sense that Cluster 1 states have a higher number of murder, assaults, and rape per 100,000 compared to the states of cluster 2.Below is the fviz_cluster() plot for AGNES AGNES cluster plotTakeawaysMake use of Hopkins statistic to realize the relationship in the data is random or real to form clusters. It should be below 0.5No matter which approaches to take either DIANA or AGNES both will give you clusters with same meaning just the number of entries in the clusters might differBoth are easy to compute and visually you can view the dendogram to understand both of the clustersMake use of fviz_cluster function to visually compute clusters in the first two principal components where majority of information liesConclusionClustering is an unsupervised machine learning algorithm in which we compute analytics mostly without a pre-defined aim to understand the relationships between the data. Once we get the understanding and trends in the data we can accordingly take necessary actions and data-driven decisions. In this case, viewing the states in cluster 1 there can be more strict law enforcement in the cluster 1 states to reduce the number of murders, assaults or rape cases. Police and officials can be more equipped and trained for such situations as well.EndNoteDatum was started with a vision to give the thoughts and working of Data Science and Machine Learning Algorithms. Continuing with this vision The Datum is growing and getting better and would like to thank all the supporters, followers and viewers from more than 65 countries worldwide. We at Datum appreciate this and hope we get continued support.Multiple Datum’s form Data, mindless Data can do wonders. Mindful you and I together can do magic! A source where you can gain/contribute Data Science, Because together we can get better!Your Data Scientifically,The DatumRead more here. See More

IntroductionHow often are you in a situation where you have 2 alternatives either yes or a no, black or a white, and so on. These are instances where you ‘classify’ your scenario into only two solutions, a number of solutions may vary but usually, they are two solutions. This is what we call as ‘Classification’ we classify the outcomes in a set number of instances usually two. This week at The Datum we have how can we use Neural Networks as the classification model. And once we have the model in hands we will go about prediction using the model and lastly, we will evaluate our model and predictions for its rightness.Model BriefToday we have a very interesting model to build which can classify breast cancer. Cancer is uncontrolled multiplication of cells. And breast cancer is caused due to the multiplication of cells in the mammary gland that are transformed into malignant cells. They have the ability to detach themselves from the tissues which they are formed and invade into the surroundings due to multiplication. Breast is formed from multiple types of cells that form the breast but, the most common breast cancers are from glandular cells or from those forming the walls of the ducts.The objective of this model today is to classify the number of benign and malignant classes which form the two most common type of breast cancer. For this model, we will be using the data already present in R named BreastCancer which is available in mlbench package. Using this data we will classify benign and malignant types of breast cancer using neural networks as a classifier.Model DataThe data as mentioned is readily available in R library of mlbench and each variable except the first, is in the form of 11 numerical attributes with values ranging from 0 through 10, with some missing values as well. Following are the data fields of this data set.$Id: Sample code number $Cl.thickness: Clump thickness $Cell.size: Uniformity of cell size $Cell.shape: Uniformity of cell shape $Marg.adhesion: Marginal adhesion $Epith.c.size: Single epithelial cell size $Bare.nuclei: Bare nuclei $Bl.cromatin: Bland chromatin $Normal.nucleoli: Normal nucleoli $Mitoses: Mitoses $Class: ClassWith good amount of information on the model brief and data we are good to implement the Neural Network Classifier model in R. To have an in-depth understanding of Neural Network concepts and functioning visiting the blog Prediction Analysis with Neural Networks and Linear Regression is recommended (here), which has all the information needed to understand the working of Neural Networks. To have complete access to this blog with entire listings, data and building the algorithm visit, read here. See More

]]>

1. What is Machine LearningHumans have evolved over the years and have come to be what we say the smartest existing species on the planet Earth. Going back to ancient days, what do you think was the basis of the human evolution of what we are today? ‘Learning’ was the key for this advancement from the very first man to the most mindful species on the planet Earth. For example, consider how an intellect thought of rubbing two stones against each other can lit up the fire. This one source of energy which was then discovered by the early man as a need was a gateway to multiple tangents like preparing food, protection against animals and insects, as a light source in the dark nights of forests. The fire has progressed in today’s world that you can find its application in daily lives. This has been a classic example of learning’ and its growth to better ourselves and find useful applications moving forward. The 1980s was the first decade where we saw a technological advancement in the form of ‘the first computer’. Since then, same as fire, there has been a massive expansion in the computer world to the smart or supercomputers which we call them today. What was the cause of this advancement? Yes, you are right, ‘Learning’. Humans researched and learned new technologies and blocks kept on adding to the system and today we see smart computers.‘Growth’, is the on-going process and always advances when plotted against time. It is an integral part of our ecosystem as things keep getting bigger and better on the basis of learning. In recent times, we humans are giving a new dimension to the computers and this is called as ‘Machine Learning’. Machine Learning is a method where a system is fed with data, and then machines interpret these data, find trends and build models to bring insights for these data from which we can make the most. These trends help us get better in the fields like sales, precision medicines, tracking locations, fraud detection and handling, advertising and lastly of course entertainment media[2]. We just keep bettering ourselves by learning and now giving this ‘learning’ power to machines we are just knocking doors of another miracle in the expansion of technology.Basically, there are two methods (discovered so far, we never know what’s next) in which machines can learn; they are Supervised and Unsupervised methods of learning. This article focuses on the latter part – the ‘Unsupervised Learning’. Let’s go back to the discovery of fire first made by a human, now understand; was the first human supervised to do so? Did he see somewhere this could happen and just replicated? Was there any source which he could refer to, to go about the procedures to light a fire? The answer clearly is ‘no’. What he had was his instinct, two stones which could fire up dry grass with a spark and this may be incurred him by daily observing things around him and finally an instinct to do so. That’s all about unsupervised learning, after the first fire discovery the latter applications and evolution of fire for the basis of preparing food, a light source in dark forests, protection against animals and insects, etc. all this is termed as supervised learning, where human already had tools to light but just the applications differed. These differed applications also required learning but since he already had the basic procedure to light a fire, it is termed as supervised learning.3.1. DefinitionsUnsupervised Learning: In unsupervised learning, we try to relate the input data in some of the other way so that we can find a relationship in the data and capitalize our service based on the data trend or relations developed in unsupervised learning.Example: Based on the ‘likes’ of people on an online music library, we can cluster people having same tastes of music and accordingly recommend them the similar type of music so that we can have them involved in our music library which is a service. You now got an idea of how you get those associations and recommendations on Amazon, You-tube, Netflix, etc.Below is the notional visualization explaining the difference between supervised (algorithms coming soon) and unsupervised learning (Hierarchical Clustering addressed in this blog) Notional Visuals: Unsupervised vs Supervised Machine Learning3.2 Unsupervised Learning AlgorithmClustering is one of the methods of Unsupervised Learning Algorithm: Here we observe the data and try to relate each data with the data similar to its characteristics, thus forming clusters. These clusters hold up a similar type of data which is distinct to another cluster. For example, a cluster of people liking jazz music is distinct from the cluster of people enjoying pop music. This work will help you gain knowledge of one of the of clustering method namely: hierarchical clustering.Hierarchical Clustering: As the name describes, clustering is done on the basis of hierarchy (by mapping dendrogram: explained further in a practical example of this work)2. Hierarchical Clustering: Approach4.1 Density and DistancesClustering from early stages was developed on the basis of these two instances i.e. Density and Distance. Let us understand each of them:Density: Goes with the name clearly, if you have a denser data in a particular plane and another dense data in the same plane but at a distance is what known as density clusters on data.Distance: Two distinct clusters or even data to be a part of cluster 1 or cluster 2 depend upon the separation distance between the two. The distance in Data Science can be computed on the basis of Euclidean distance, Manhattan (City Block) distance, Hamming distance, Cosine distance. These distances are the basics and easy algebraic and geometric understanding. Can be easily refreshed if you are not well versed with them by going back to basics. The figure below illustrates Density and Distance and how it brings clustering, there are two types of data both having a specific density in their respective space, thus they are clustered together as both clusters have distinct characteristics based upon the data they contain. Further, there is also a distance between the two clusters, these distances can be between the individual datums or considering clusters as a whole. Notional Illustration of Density and Distance in Hierarchical Clustering4.2 Implementing Hierarchical Clustering in R4.2.1 The Model Data:I have used R language to code for clustering on the World Health Organization (WHO) data, containing 35 countries limited to the region of ‘America’[3]. Following are the data fields:$Country: contains names of the countries of region America (datatype: string)$Population: contains population count of respective countries (datatype: integer)$Under15: count of population under 15 years of age (datatype: number)$Over60: count of population over 60 years of age (datatype: number) $FertilityRate: fertility rate of the respective countries (datatype: number)$LifeExpectancy: Life expectancy of people of respective countries (datatype: integer)$CellularSubscribers: Count of people of respective countries possessing cellular subscriptions (datatype: number)$LiteracyRate: Rate of Literacy of respective countries (datatype: number)$GNI: Gross National Income of respective countries (datatype: number)4.2.2. Setting Goals and Expectations:The goal here is to group countries in terms of their health using data fields mentioned above. The data is loaded in R as ‘data’ object and the next image shows the R generated a summary of our data:> Summary of Data generated in R4.2.3 Data PreparationHere, it becomes important that we have all the data fields which are scaled in a disciplined way. This can be considered as the biggest trick to getting your clusters right. If the data is not scaled in the same way, we will not get those distances right and eventually very low confident and bad clusters generation. Scaling the data will form a discipline in all the data field and thus easing our tasks of generating distances. Here I am using scale() function of R to scale the data for mean as 0 and standard deviation of 1 for all the data field except first i.e. $Country. Scale function generates two attributes ‘scaled: center’ and ‘scaled: scale’. Both are then saved in different objects. They can be used again if required to normalize data back again. The below image shows the codes for the same:/> Listing for scaling the data 4.2.4. Hierarchical ClusteringTo compute hierarchical clustering, I first compute distances using R’s dist() function, to compute distance I have used Euclidean distance, but other distances like Manhattan can also be used. For categorical variables, one might use method=” binary” so as to compute Hamming distance. After having the distance object defined, now I use hclust() function to compute hierarchical clusters using ‘ward.D2’ method. Ward’s[2] method is the classic method to compute such clusters as it starts with individual data points and merges other data points to form a cluster on iterations, this approach of Ward helps minimize total within the sum of squares (WSS) of the clustering. The figures below show the cluster model and dendrogram of the generated clusters and cluster groups. Listings of computing hierarchical model Groups formed in the model Dendrogram generated by Hierarchical modelFurther, we can print the clusters by forming functions to paste the clusters from cutree groups formed. Below shows the listings and printed clusters Listings for printing clusters formed Clusters formed by Hierarchical ClusteringThe above information can be interpreted as follows:Clusters groups are formed with some similarity geographical areas (with minor trade-offs) into consideration.Cluster 1 has high literacy rate with minimum and maximum of being 94.10 and 99.0 respectivelyCluster 2 has dropped literacy rates with minimum being 61.0 and maximum being 92.46, also they have lowest GNI amongst all the cluster groupsCluster 3 has literacy rates around 90’s but not as widely separated as cluster 2Cluster 4 has highest literacy rates minimum being 99Cluster 5 United States stands alone mainly because of the fact of the varied numbers compared to all other countries in American Region and also, to consider highest GNI amongst all of the countries of $48,820Interpreting cluster outputs is also about perception, what you can see maybe I cannot and vice-versa. It is always good to spend some time and bring out as much insights as possible.4.2.5. Visualizing ClustersIn this section, I am going to show you how you can visualize the clusters that are been formed above using R’s ggplot2 library. Clustering visualization can be done using any of the two principal components, remember visualizing a cluster in the first two principal components will show you most of the useful information. This is because, if you have N number of variables in your data, the principal components describe the hyper-ellipsoid space in N-space that bound the data. You are likely to address all the major data variations possible on your data in this space captured in 2 Dimension. Thus, I am using the first 2 principal components, I am calling R’s prcomp() function which will help me do decomposition of principal components. Below are the listing and the later shows the obtained plot. Listing for plotting clusters in first 2 principal components Visualizing clusters represented in first 2 principal componentsFrom the above we observe:Cluster 2 is widely spread and is distinct from the other clustersCluster 4 which has just 2 countries Canada and Cuba has its own spaceCluster 5 of USA is also distinct but in and around the mingling of other clustersRest 1 and 3 of the clusters co-mingle among themselvesNow we have clustered our data and this is all about unsupervised learning and finding some relation among the individual data so that we can relate and find groups. All down and dusted? But wait, are you sure whether the clusters that are formed are based upon real data relationships? Or is it just a fuss of the clustering algorithm?4.2.6 Bootstrap evaluation of the clustersBootstrapping our clusters is the best way to test our clusters against variations in our dataset. We can bootstrap using function clusterboot() in the R package fpc. clusterboot() possesses an interface with hclust() thus making it possible to evaluate our formed clusters based on re-sampling which helps us evaluate how stable is our given cluster.J(A,B) = |A intersection B| / |A union B|The above similarity equation is known as Jaccard Similarity. Jaccard is the basis of functioning of clusterboot(). Jaccard Similarity states ‘similarity between two sets A and B is the ratio of the number of elements in the intersection of A and B over the number of elements in the union of A and B.’Following are the 4 steps to bootstrap your clusters:Cluster the data as usual as I have done aboveCreate a new data-set (note: size should be similar), this can be easily done by re-sampling our original WHO data-set with few replacements (same data points replicated more than once, and some included not at all). We now cluster new data-set.We now look for the similarities in between our original clusters and the new clusters (gives the maximum Jaccard coefficient). There will be a maximum Jaccard coefficient linked to each cluster which indicates dissolution of cluster. The threshold can be set as 0.5 of the maximum Jaccard coefficient is less than 0.5 the original cluster will be considered as to be dissolved, i.e. it did not show up in new clustering. This tells us that a cluster that’s dissolved frequently is probably not a ‘real’ clusterLoop steps 2 and 3 multiple timesFollowing can be considered as maximum Jaccard coefficients values and indicators for the clusters:Maximum Jaccard CoefficientIndicator (Stability)< 0.6Unstable Cluster0.6 – 0.75Average but not trusted>0.85Highly StableI am now running clusterboot() on my data. I am going to load fpc package and set the desired number of clsuters as 5 (as this was in our original clustering as well), further I am using clusterboot with its respective arguments to get the clusters. Moving ahead clusters will be printed to get the clusters. Below is the listing and output. Listing to clusterboot our clusters Printed clusters from clusterboot (same as original clusters) Now trust me, if you get this you get a good night’s sleep as a Data Analyst. So, in a nutshell here, I got the same clusters as our original clusters…… exactly same. This shows that our clusters were formed really well to gain high maximum Jaccard coefficients which indicate high stability clusters. Next two more steps, I am going to take this at another level to just verify how exactly stable and useful were our clusters formed. Remember I told you before that each cluster is dissolved for its maximum Jaccard co-efficient? Now I am going to show what you can do to find out a number of dissolutions a cluster went through and each cluster’s maximum Jaccard coefficients. Below is the listing for the same and the output generated. Listing to extract maximum Jaccard coefficients for each cluster and cluster dissolution index (on 100) Maximum Jaccard co-efficient and each cluster dissolution (on 100)The bootmean gives us the maximum Jaccard coefficients, the numbers for our clusters are somewhat acceptable as the are almost all above 0.6 that is above the acceptable stability range.The bootbrd gives us number of times each cluster was dissolved when clusterboot did 100 times ‘boot’ on each of the clusters.4.2.7 Are you picking exact K value for your clusters?For this example illustrated, I visually picked up a number of clusters = 5 just by observing the dendrogram formed, but will you get such easy pickings each time? the answer is no, the real world data is very complex and it will not be the case to visually pick an optimal number of clusters each time. ‘Total within the sum of squares’ and ‘Calinski-Harabasz’ are the two computing algorithms which through computations give us the optimal number of K value for our clusters. This is not listed in this document but it is always good to know things.3. ConclusionHierarchical clustering is the best of the modeling algorithm in Unsupervised Machine learning. The key takeaway is the basic approach in model implementation and how you can bootstrap your implemented model so that you can confidently gamble upon your findings for its practical use. Remember the following:Prepare your data by scaling it to the mean value of 0 and standard deviation of 1Implement clustering and print/visualize the clustersBootstrap your clusters for testing cluster model confidenceCheers,NeerajSee More