Till your good is better and better is best!ENSEMBLEYes, the above quote is so true. We humans have ability to rate things and find some or the other matrices to measure things and evaluate them or their performances. Similarly, in Data Science you can measure your model’s accuracy and performance! The very first model that you build, you can set its matrices from the results and these matrices become your benchmark. Moving ahead you strive to or try to make other models on the same data which can break those benchmark. Which can predict better, which has better accuracy and so on… matrices can be anything. This vital technique is known as ‘Ensemble’ where you build one model, set its performance as measuring matrices and try to build other models which are better performers then previous matrices. This is the topic of the week! We will learn how to Ensemble models on a very interesting “Diabetes” data.THE DATAThis data is taken from Kaggle and its best description is as follows provided on the portal: “The data was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above. ” Below is the R-Console Summary and Structure of this Data for better interpretation: Summary of Diabetes Data Structure of Diabetes DataHere, for this data we will build models to predict the “Outcome” i.e. Diabetes Yes or No and we will perform Ensemble techniques to better our predictions. We will be using several techniques to do that, each technique is briefed as we keep building our codes below. Here is the link to this data from where it can be downloaded.https://www.kaggle.com/kandij/diabetes-datasetTHE MODELEDA – Exploratory Data AnalysisThis is typically done to better understand our data before we actually start building our models. This model has a lot to do with CARET library of R. CARET has few functions from which you can easily plot some graphs for EDA – Exploratory Data Analysis. Below is the listings for loading the data and computing some EDA plots.Scatter plot Matrix Listing for Loading Data and First 4 data field Scatter plot Scatter plot matrix for first 4 Diabetes Data fields Listing for Scatter plot matrix of last 4 data fields Scatter plot matrix for last 4 Diabetes Data fieldsOverlayed Density plots Overlayed Density plots of first 4 data fields Listing for Overlayed Density plot for last 4 data fields Overlayed Density plots of last 4 data fieldsBox Plots Listing for Box plot of first 4 data fields Box plots for first 4 data fields Listings for Box Plot of last 4 data fields Box plots for last 4 data fieldsJust eyeballing through and juggling around all the plots above you can get a better idea of the data. This is why we do EDA to better understand our data.KNN – ModelThe very first model we will be building is ‘k nearest neighbor’ model. Further, for this model we will be using 10 folds cross validation and 10 repeats to evaluate the model performance and for the output we will focus on the accuracy. The accuracy for this model will be our benchmark for Ensemble. Below are the listings for building the knn model. Listing for building knn model knn model outputAs we can see above with 10 folds cross validation and 10 times repetition the best model formed was for k = 21 and equivalent accuracy for k = 21 is 0.7532 or 75.32%. This value is our bench mark value. Moving ahead building ensemble models we will strive to attain an accuracy better then 75.32%.BaggingBagging or Bootstrapping Aggregation is a process in which for any given data set multiple random subsets of data set is created. By running an algorithm on these data sets it is possible that same data samples get picked up in multiple subsets.Below is the listing for building bagging model and we will be building “tree bag” method for bagging. Listing for building bagging model Output of Bagging modelThe above is the output of the bagging model, and here there is a small hike in the model accuracy from 75.32% we have attained an accuracy of 75.58%. This is not a big leap but still a better version of the model in terms of accuracy.Random Forests ModelOur next model is the Random forests model. They consists of a large number of individual decision trees and each tree splits out in class prediction and the class. With the majority rule and class with most votes is the output. Below is the listing for building Random forests model. Listing for building random forests model Random forests modelAs we see above random forests model gives the accuracy for mtry = 2 which is 76.60% which is again better then our knn model accuracy of 75.32%.BoostingThis is how boosting works, it builds the model with the with a Machine Learning algorithm that does predictions on entire data set.In these predictions there are some misclassified instances that are used for the second model. The second model learns from the misclassified instances of the first model. This process is goes on until the stopping process. The final output is voted from all the outcomes. There are number of Boosting algorithms we will be working with Gradient Boosting Machines (GBM) and XGBoost. Listing for building GBM Model Output of Boosting ModelWith the boosting model we got the accuracy of 76.61% again a better accuracy then knn which was 75.32%.XGBoostBelow is the listing for XGBoost Listing for XGBoost Output of XGBoostThe above output states that the best model is one with colsample_bytree = 0.7 and max_depth = 3 was selected. And for that model the accuracy is 75.54% again which is better then our knn model accuracy.StackingThis is very interesting model to end with. Can be called as model of models that too without manipulations. And what does that mean? So far, we have built the models which were exposed to subset of data and not actual entire data and its relationship. Stacking runs multiple Machine Learning Algorithms (the ones we mention) on the entire original data set. Stacking gets the combined power of several ML algorithms through getting the final prediction by means of voting or averaging as we do in other types of ensembles. Below is the listing for building stacking model. Listing for Stacking Stacking resultsStacking gives us a plethora of output from which we can analyze the best outcome from multiple ML algorithms. In the results above if we look for accuracy, mean and median all the accuracy are above our knn-accuracy of 75.32. The highest accuracy that could be attained across all the models is 87.01% which is way better than our knn accuracy.CONCLUSIONEnsemble models can be deployed to get better results thus enhancing our final model performanceThe highest accuracy that we could attain is 87.01% which is much higher than our accuracy of first knn-model accuracy of 75.32%To go beyond ensemble one can focus on Feature engineering of the modelThis is all about ensemble modeling in data science, where we keep pushing the boundaries of our model matrices so that we can improve our model performance hence increase or model impact on the data-driven decision. It will always be a wise idea to use these techniques to better our model results rather than just building a single model and deploying it.Read more here. See More

Till your good is better and better is best!ENSEMBLEYes, the above quote is so true. We humans have ability to rate things and find some or the other matrices to measure things and evaluate them or their performances. Similarly, in Data Science you can measure your model’s accuracy and performance! The very first model that you build, you can set its matrices from the results and these matrices become your benchmark. Moving ahead you strive to or try to make other models on the same data which can break those benchmark. Which can predict better, which has better accuracy and so on… matrices can be anything. This vital technique is known as ‘Ensemble’ where you build one model, set its performance as measuring matrices and try to build other models which are better performers then previous matrices. This is the topic of the week! We will learn how to Ensemble models on a very interesting “Diabetes” data.THE DATAThis data is taken from Kaggle and its best description is as follows provided on the portal: “The data was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above. ” Below is the R-Console Summary and Structure of this Data for better interpretation: Summary of Diabetes Data Structure of Diabetes DataHere, for this data we will build models to predict the “Outcome” i.e. Diabetes Yes or No and we will perform Ensemble techniques to better our predictions. We will be using several techniques to do that, each technique is briefed as we keep building our codes below. Here is the link to this data from where it can be downloaded.https://www.kaggle.com/kandij/diabetes-datasetTHE MODELEDA – Exploratory Data AnalysisThis is typically done to better understand our data before we actually start building our models. This model has a lot to do with CARET library of R. CARET has few functions from which you can easily plot some graphs for EDA – Exploratory Data Analysis. Below is the listings for loading the data and computing some EDA plots.Scatter plot Matrix Listing for Loading Data and First 4 data field Scatter plot Scatter plot matrix for first 4 Diabetes Data fields Listing for Scatter plot matrix of last 4 data fields Scatter plot matrix for last 4 Diabetes Data fieldsOverlayed Density plots Overlayed Density plots of first 4 data fields Listing for Overlayed Density plot for last 4 data fields Overlayed Density plots of last 4 data fieldsBox Plots Listing for Box plot of first 4 data fields Box plots for first 4 data fields Listings for Box Plot of last 4 data fields Box plots for last 4 data fieldsJust eyeballing through and juggling around all the plots above you can get a better idea of the data. This is why we do EDA to better understand our data.KNN – ModelThe very first model we will be building is ‘k nearest neighbor’ model. Further, for this model we will be using 10 folds cross validation and 10 repeats to evaluate the model performance and for the output we will focus on the accuracy. The accuracy for this model will be our benchmark for Ensemble. Below are the listings for building the knn model. Listing for building knn model knn model outputAs we can see above with 10 folds cross validation and 10 times repetition the best model formed was for k = 21 and equivalent accuracy for k = 21 is 0.7532 or 75.32%. This value is our bench mark value. Moving ahead building ensemble models we will strive to attain an accuracy better then 75.32%.BaggingBagging or Bootstrapping Aggregation is a process in which for any given data set multiple random subsets of data set is created. By running an algorithm on these data sets it is possible that same data samples get picked up in multiple subsets.Below is the listing for building bagging model and we will be building “tree bag” method for bagging. Listing for building bagging model Output of Bagging modelThe above is the output of the bagging model, and here there is a small hike in the model accuracy from 75.32% we have attained an accuracy of 75.58%. This is not a big leap but still a better version of the model in terms of accuracy.Random Forests ModelOur next model is the Random forests model. They consists of a large number of individual decision trees and each tree splits out in class prediction and the class. With the majority rule and class with most votes is the output. Below is the listing for building Random forests model. Listing for building random forests model Random forests modelAs we see above random forests model gives the accuracy for mtry = 2 which is 76.60% which is again better then our knn model accuracy of 75.32%.BoostingThis is how boosting works, it builds the model with the with a Machine Learning algorithm that does predictions on entire data set.In these predictions there are some misclassified instances that are used for the second model. The second model learns from the misclassified instances of the first model. This process is goes on until the stopping process. The final output is voted from all the outcomes. There are number of Boosting algorithms we will be working with Gradient Boosting Machines (GBM) and XGBoost. Listing for building GBM Model Output of Boosting ModelWith the boosting model we got the accuracy of 76.61% again a better accuracy then knn which was 75.32%.XGBoostBelow is the listing for XGBoost Listing for XGBoost Output of XGBoostThe above output states that the best model is one with colsample_bytree = 0.7 and max_depth = 3 was selected. And for that model the accuracy is 75.54% again which is better then our knn model accuracy.StackingThis is very interesting model to end with. Can be called as model of models that too without manipulations. And what does that mean? So far, we have built the models which were exposed to subset of data and not actual entire data and its relationship. Stacking runs multiple Machine Learning Algorithms (the ones we mention) on the entire original data set. Stacking gets the combined power of several ML algorithms through getting the final prediction by means of voting or averaging as we do in other types of ensembles. Below is the listing for building stacking model. Listing for Stacking Stacking resultsStacking gives us a plethora of output from which we can analyze the best outcome from multiple ML algorithms. In the results above if we look for accuracy, mean and median all the accuracy are above our knn-accuracy of 75.32. The highest accuracy that could be attained across all the models is 87.01% which is way better than our knn accuracy.CONCLUSIONEnsemble models can be deployed to get better results thus enhancing our final model performanceThe highest accuracy that we could attain is 87.01% which is much higher than our accuracy of first knn-model accuracy of 75.32%To go beyond ensemble one can focus on Feature engineering of the modelThis is all about ensemble modeling in data science, where we keep pushing the boundaries of our model matrices so that we can improve our model performance hence increase or model impact on the data-driven decision. It will always be a wise idea to use these techniques to better our model results rather than just building a single model and deploying it.Read more here. See More

]]>

Deploying Natural Language Processing for Product ReviewsIntroductionWe have data all around us and there are of two forms of data namely; tabular and text. If you have good statistical tools tabular data has a lot to convey. But it is really hard to get something out of the text, especially the natural language spoken text. So what is natural language? We, humans, have very complex language and natural language is the true form of human language which is spoken/written with sincerity also surpassing grammatical rules. To consider the best example where you can find this language is in “Reviews”. You write review mainly for two reasons, either you are very happy with the product or very disappointed with it and, with your reviews and a Machine Learning Algorithm, entities like Amazon can figure out whether the product they are selling is good or bad. Depending upon the results on the analysis of the reviews they can make further decisions on that product for their betterment.Consider an example wherein we have thousands of comments in a review of a product sold by Amazon. It is very time consuming and hard for an individual to sit back and read all the thousands of comments and make a decision about whether people are liking the product or not. We can simply make machines do this work by implementing a Machine Learning Algorithm from which Machines can learn the Natural Human Language and make decisions. An approach to this is Machine “bags” words basically of two types viz; positive and negative words. After bagging these words machines apply majority rules to identify: are there more of positive or negative words thus helping us identifying the acceptance of the product sold.The technique described above is called as “Bagging of Words (BoW)” Natural Language Processing (NLP) Machine Learning Algorithm which is what we are going to learn today and also could have been this complex name of this blog but phew, I managed with a better one. So, there are other techniques too called as Word Embedding NLP and might be more but as far as I know, they are these two. Today we are only focusing on “BoW” type of NLP in this article.BoW NLP Implementation Road MapImplementing this algorithm is little complex and can be confusing as it is longer than the usual algorithm process. So, below is the block diagram or a road map to give you an idea about the steps to implement this algorithm and finally make predictions with it. You can always fall back on this diagram if confused to understand why a particular thing is being done! Flow chart of implementing BoW NLP algorithm For machines to learn our natural language they need to be provided a certain format. Thus, we need to do data pre-processing so that we present data in a certain way expected by the machines to make them learn. In pre-processing we have to make sure data is in strictly two columns one for Rating and other for RatingText. Rating is either 1 or 2 (negative and positive respectively) and Rating Text will be equivalent rating text or review. Rating Text should not have unusual spacing and no punctuation. Below is the two snippets showing the data before and after pre-processing so that you understand better. Data before pre-processing Data after pre-processing The next step is creating Corpus (Bag of Words) which can be easily done by a function in R (addressed later during listings). From this corpus, we then create Document Term Matrix (DTM) wherein a matrix represents the number of times a word is repeated throughout the entire corpus. Once we have the document term matrix we then split our data into training and testing sets.The last step is implementing our model. Here we are deploying the Naive Bayes model, once implementing the model on the training data set we later make predictions on the test data set using the model created. In the end, we will compute the accuracy of our prediction to estimate how well our model performed in making predictions.Model DataThe data is Amazon review data set, it is open data set available for use which was made available by Xiang Zhang. It was constructed as review score 1 for negative and 2 for positive ratings, it has in total 1,800,000 training and 200,000 testing samples. Don’t worry we are just using 1000 of the instances to avoid computational complexities. The data is available for download at the link below:https://drive.google.com/file/d/1SncEr1yOBW4OmxWIYaKsNw_MS1Rh8S8j/view?usp=sharingJust you need to make sure, when you have the file downloaded convert it into .csvformat so that you can use for analysis in R.ModelPre-processing DataFirst things first as mentioned before we will begin with our data pre-processing. We will load our data first, followed by converting it into a data frame from a character string and then we will separate the data into two columns viz. “Rating” and “RatingText”. We can separate using a R function of the library “tidyr”. Below are the listings. Listing for loading 1001 lines of Amazon Review Data set View of the first 10 rows of the loaded data Listing for Separating data into 2 columns “Rating” & “RatingText” View of first 10 rows of data separated Now we need to make sure we retain only alphanumeric entries of the data and removing anything other than that. Below is the listing to do that for both the separated columns. Retaining only alphanumeric entries in the data Lastly, we need to remove unwanted spacing in the text. Below is the listing for removing the unusual spacing and maintaining a single spacing throughout Removing unusual spacing View of removed spacing data We now simply put our data into an object named “text” and check the data dimension to be sure that complete data is written to the object. Console view of defining an object “text” and storing our data The dimension print above of the text makes sure we have entire data into the object text.Creating CorpusOur next step includes creating corpus this can be easily done with a function VCorpus – Volatile Corpus from the library tm. To access this function one can download the package tm with install.packages(“tm”) followed by loading the library in R environment using library(tm). Once loaded we can use tm() function. This function will convert our review text into corpus. Listing for creating corpus using tm() function From the volatile corpus, we create a Document Term Matrix (DTM). A DTM is a sparse matrix that is created using the tm library’s DocumentTermMatrix function. The rows of the matrix indicate documents and the columns indicate features, that is, words. The matrix is sparse because all unique unigram sets of the dataset become columns in DTM and, as each review comment does not have all elements of the unigram set, most cells will have a 0, indicating the absence of the unigram.While it is possible to extract n-grams (unigrams, bigrams, trigrams, and so on) as part of the BoW approach, the tokenize parameter can be set and passed as part of the control list in the Document Term Matrix function to accomplish n-grams in DTM. It must be noted that using n-grams as part of the DTM creates a very high number of columns in the DTM. This is one of the demerits of the BoW approach, and, in some cases, it could stall the execution of the project due to limited memory. As our specific case is also limited by hardware infrastructure, we restrict ourselves by including only the unigrams in DTM in this project. Apart from just generating unigrams, we also perform some additional processing on the reviews text document by passing parameters to the control list in the tm library’s DocumentTermMatrix function. The processing we do on the review text documents during the creation of the DTM are:All the text will be in lower caseRemoving number, if anyRemoving Stop words such as a, an, in, and the. As stop words do not add any value in analyzing sentimentsRemoving Punctuation, again they do not add any meaning in the analyzing sentimentsPerform stemming, that is converting a word into its natural form example removing s from pluralsBelow is the listing for creating DTM from our Volatile corpus and doing the above mentioned pre-processing Listing for creating DTM and pre-processing Inspect View of our train_dtm Above we can see how our DTM is created where rows indicate document and columns indicate features as explained before. We see the output has 1001 documents with 5806 columns representing unique unigram from the reviews. We also see our DTM is 99% sparse and consists of non-zero entries of only 34,557 cells. The non-zero cells represent the frequency of occurrence of the word on the column in the document represent on the row of the DTM. The DTM tends to get very big, even for normal-sized datasets. Removing sparse terms, that is, terms occurring only in very few documents, is the technique that can be tried to reduce the size of the matrix without losing significant relations inherent to the matrix. Let’s remove sparse columns from the matrix. We will attempt to remove those terms that have at least a 99% of sparse elements with the following listings: Removing 99% of sparse elements Inspection after removing Sparse We now see the sparsity being reduced to 97% and the number of unigrams reduced to 688. We can now implement our algorithm but first let us split our data into training and testing sets. Splitting data into Training and Testing sets for implementing algorithm Once we have the data split we can now build our model on the Training set and later we can test our model for prediction on the testing set.Implementing Naive Bayes ModelOne point to be noted is that Naive Bayes model works on the nominal data. And our data currently has entrees which are for the number of times a word is being repeated and number of times a word gets repeated does not affect our sentiment thus we can convert our data into nominal data with following rule: number exists = Y and if 0 then N. Below is the listing to do this: Listing for converting data to Nominal Data Snippet view of converted nominal data Now as we have nominal data we can implement a Naive Bayes model. The Naive Bayes algorithm can be implemented with function naiveBayes() in R library e1071. If you do not have this library you can use the following codes to install the package followed by loading it into the R environment as follows: Listing to Install and use e1071 package Below is the listing to implement Naive Bayes model Listing to implement Naive Bayes model Finally, with the model in hand, we will now make predictions for the test data set and check for the accuracy of the predictions done by the model. Below are the listing and checking accuracy Listing for computing predictions and checking accuracy Accuracy obtained from the model With the quick implementation of BoW alongside with Naive Bayes model we got an accuracy of 79.5%. The accuracy can be increased further by focusing on the addition of different elements like new feature creation and so on.BoW NLP – TakeawaysBag of Words (BoW) Natural Language Processing (NLP) using a Naive Bayes model is very simple to implement algorithm when it comes to examining Natural language by Machines. Without very much efforts the model gives us a prediction accuracy of 79.5% which is a really good accuracy when it comes to simple model implementation. The only drawback of BoW is that it does not undertake the semantics of the words; example “buy a used bike” and “purchase old motorcycle”, BoW will treat them as two very different sentences. To overcome this we need to take another method known as Word Embedding into consideration.Read more hereSee More

Deploying Natural Language Processing for Product ReviewsIntroductionWe have data all around us and there are of two forms of data namely; tabular and text. If you have good statistical tools tabular data has a lot to convey. But it is really hard to get something out of the text, especially the natural language spoken text. So what is natural language? We, humans, have very complex language and natural language is the true form of human language which is spoken/written with sincerity also surpassing grammatical rules. To consider the best example where you can find this language is in “Reviews”. You write review mainly for two reasons, either you are very happy with the product or very disappointed with it and, with your reviews and a Machine Learning Algorithm, entities like Amazon can figure out whether the product they are selling is good or bad. Depending upon the results on the analysis of the reviews they can make further decisions on that product for their betterment.Consider an example wherein we have thousands of comments in a review of a product sold by Amazon. It is very time consuming and hard for an individual to sit back and read all the thousands of comments and make a decision about whether people are liking the product or not. We can simply make machines do this work by implementing a Machine Learning Algorithm from which Machines can learn the Natural Human Language and make decisions. An approach to this is Machine “bags” words basically of two types viz; positive and negative words. After bagging these words machines apply majority rules to identify: are there more of positive or negative words thus helping us identifying the acceptance of the product sold.The technique described above is called as “Bagging of Words (BoW)” Natural Language Processing (NLP) Machine Learning Algorithm which is what we are going to learn today and also could have been this complex name of this blog but phew, I managed with a better one. So, there are other techniques too called as Word Embedding NLP and might be more but as far as I know, they are these two. Today we are only focusing on “BoW” type of NLP in this article.BoW NLP Implementation Road MapImplementing this algorithm is little complex and can be confusing as it is longer than the usual algorithm process. So, below is the block diagram or a road map to give you an idea about the steps to implement this algorithm and finally make predictions with it. You can always fall back on this diagram if confused to understand why a particular thing is being done! Flow chart of implementing BoW NLP algorithm For machines to learn our natural language they need to be provided a certain format. Thus, we need to do data pre-processing so that we present data in a certain way expected by the machines to make them learn. In pre-processing we have to make sure data is in strictly two columns one for Rating and other for RatingText. Rating is either 1 or 2 (negative and positive respectively) and Rating Text will be equivalent rating text or review. Rating Text should not have unusual spacing and no punctuation. Below is the two snippets showing the data before and after pre-processing so that you understand better. Data before pre-processing Data after pre-processing The next step is creating Corpus (Bag of Words) which can be easily done by a function in R (addressed later during listings). From this corpus, we then create Document Term Matrix (DTM) wherein a matrix represents the number of times a word is repeated throughout the entire corpus. Once we have the document term matrix we then split our data into training and testing sets.The last step is implementing our model. Here we are deploying the Naive Bayes model, once implementing the model on the training data set we later make predictions on the test data set using the model created. In the end, we will compute the accuracy of our prediction to estimate how well our model performed in making predictions.Model DataThe data is Amazon review data set, it is open data set available for use which was made available by Xiang Zhang. It was constructed as review score 1 for negative and 2 for positive ratings, it has in total 1,800,000 training and 200,000 testing samples. Don’t worry we are just using 1000 of the instances to avoid computational complexities. The data is available for download at the link below:https://drive.google.com/file/d/1SncEr1yOBW4OmxWIYaKsNw_MS1Rh8S8j/view?usp=sharingJust you need to make sure, when you have the file downloaded convert it into .csvformat so that you can use for analysis in R.ModelPre-processing DataFirst things first as mentioned before we will begin with our data pre-processing. We will load our data first, followed by converting it into a data frame from a character string and then we will separate the data into two columns viz. “Rating” and “RatingText”. We can separate using a R function of the library “tidyr”. Below are the listings. Listing for loading 1001 lines of Amazon Review Data set View of the first 10 rows of the loaded data Listing for Separating data into 2 columns “Rating” & “RatingText” View of first 10 rows of data separated Now we need to make sure we retain only alphanumeric entries of the data and removing anything other than that. Below is the listing to do that for both the separated columns. Retaining only alphanumeric entries in the data Lastly, we need to remove unwanted spacing in the text. Below is the listing for removing the unusual spacing and maintaining a single spacing throughout Removing unusual spacing View of removed spacing data We now simply put our data into an object named “text” and check the data dimension to be sure that complete data is written to the object. Console view of defining an object “text” and storing our data The dimension print above of the text makes sure we have entire data into the object text.Creating CorpusOur next step includes creating corpus this can be easily done with a function VCorpus – Volatile Corpus from the library tm. To access this function one can download the package tm with install.packages(“tm”) followed by loading the library in R environment using library(tm). Once loaded we can use tm() function. This function will convert our review text into corpus. Listing for creating corpus using tm() function From the volatile corpus, we create a Document Term Matrix (DTM). A DTM is a sparse matrix that is created using the tm library’s DocumentTermMatrix function. The rows of the matrix indicate documents and the columns indicate features, that is, words. The matrix is sparse because all unique unigram sets of the dataset become columns in DTM and, as each review comment does not have all elements of the unigram set, most cells will have a 0, indicating the absence of the unigram.While it is possible to extract n-grams (unigrams, bigrams, trigrams, and so on) as part of the BoW approach, the tokenize parameter can be set and passed as part of the control list in the Document Term Matrix function to accomplish n-grams in DTM. It must be noted that using n-grams as part of the DTM creates a very high number of columns in the DTM. This is one of the demerits of the BoW approach, and, in some cases, it could stall the execution of the project due to limited memory. As our specific case is also limited by hardware infrastructure, we restrict ourselves by including only the unigrams in DTM in this project. Apart from just generating unigrams, we also perform some additional processing on the reviews text document by passing parameters to the control list in the tm library’s DocumentTermMatrix function. The processing we do on the review text documents during the creation of the DTM are:All the text will be in lower caseRemoving number, if anyRemoving Stop words such as a, an, in, and the. As stop words do not add any value in analyzing sentimentsRemoving Punctuation, again they do not add any meaning in the analyzing sentimentsPerform stemming, that is converting a word into its natural form example removing s from pluralsBelow is the listing for creating DTM from our Volatile corpus and doing the above mentioned pre-processing Listing for creating DTM and pre-processing Inspect View of our train_dtm Above we can see how our DTM is created where rows indicate document and columns indicate features as explained before. We see the output has 1001 documents with 5806 columns representing unique unigram from the reviews. We also see our DTM is 99% sparse and consists of non-zero entries of only 34,557 cells. The non-zero cells represent the frequency of occurrence of the word on the column in the document represent on the row of the DTM. The DTM tends to get very big, even for normal-sized datasets. Removing sparse terms, that is, terms occurring only in very few documents, is the technique that can be tried to reduce the size of the matrix without losing significant relations inherent to the matrix. Let’s remove sparse columns from the matrix. We will attempt to remove those terms that have at least a 99% of sparse elements with the following listings: Removing 99% of sparse elements Inspection after removing Sparse We now see the sparsity being reduced to 97% and the number of unigrams reduced to 688. We can now implement our algorithm but first let us split our data into training and testing sets. Splitting data into Training and Testing sets for implementing algorithm Once we have the data split we can now build our model on the Training set and later we can test our model for prediction on the testing set.Implementing Naive Bayes ModelOne point to be noted is that Naive Bayes model works on the nominal data. And our data currently has entrees which are for the number of times a word is being repeated and number of times a word gets repeated does not affect our sentiment thus we can convert our data into nominal data with following rule: number exists = Y and if 0 then N. Below is the listing to do this: Listing for converting data to Nominal Data Snippet view of converted nominal data Now as we have nominal data we can implement a Naive Bayes model. The Naive Bayes algorithm can be implemented with function naiveBayes() in R library e1071. If you do not have this library you can use the following codes to install the package followed by loading it into the R environment as follows: Listing to Install and use e1071 package Below is the listing to implement Naive Bayes model Listing to implement Naive Bayes model Finally, with the model in hand, we will now make predictions for the test data set and check for the accuracy of the predictions done by the model. Below are the listing and checking accuracy Listing for computing predictions and checking accuracy Accuracy obtained from the model With the quick implementation of BoW alongside with Naive Bayes model we got an accuracy of 79.5%. The accuracy can be increased further by focusing on the addition of different elements like new feature creation and so on.BoW NLP – TakeawaysBag of Words (BoW) Natural Language Processing (NLP) using a Naive Bayes model is very simple to implement algorithm when it comes to examining Natural language by Machines. Without very much efforts the model gives us a prediction accuracy of 79.5% which is a really good accuracy when it comes to simple model implementation. The only drawback of BoW is that it does not undertake the semantics of the words; example “buy a used bike” and “purchase old motorcycle”, BoW will treat them as two very different sentences. To overcome this we need to take another method known as Word Embedding into consideration.Read more hereSee More

IntroductionData Science and Machine Learning are furtive, they go un-noticed but are present in all ways possible and everywhere. They contribute significantly in all the fields they are applied and leave us with evidence which we can rely on and take data-driven directions. Today, a very interesting area we are going to see an example of Data Science and Machine Learning is ‘Crimes’. We are going to focus on types of crimes taken place across 50 states in the USA and cluster them. We cluster them for the following reasons:To understand state-wise crime demographicsTo make laws applicable in states depending upon the type of crime taking place most oftenGetting police and forces ready by the type of crime done in respective statesPredicting the crimes that may happen and thus taking measures in advanceThe above are the few applications which can be considered but they are not exhaustive, depending upon the data reports produced by the algorithms there can be more such applications which can be deployed. Our concern today is to understand how we can deploy Hierarchical clustering in two ways i.e. DIANA and AGNES for the USA Arrests data.TheoryHierarchical ClusteringA clustering algorithm that groups similar objects in the form of clusters. The scope for a detailed understanding of Hierarchical clustering in this blog is not feasible as it is a vast topic. To get a detailed understanding of Hierarchical clusters follow this link. By hitting the above link you will understand in and out of hierarchical clusters concerning their formation, the role of density and distances in hierarchical clustering with visuals diagrams. What we will focus today is on DIANA and AGNES hierarchical clustering.DIANA Hierarchical ClusteringDIANA is also known as DIvisie ANAlysis clustering algorithm. It is the top-down approach form of hierarchical clustering where all data points are initially assigned a single cluster. Further, the clusters are split into two least similar clusters. This is done recursively until clusters groups are formed which are distinct to each other. Notional representation of DIANATo better understand consider the above figure, In step 1 that is the blue outline circle can be thought of as all the points are assigned a single cluster. Moving forward it is divided into 2 red-colored clusters based on the distances/density of points. Now, we have two red-colored clusters in step 2. Lastly, in step 3 the two red clusters are further divided into 2 black dotted each, again based on density and distances to give us final four clusters. Since the points in the respective 4 clusters are very similar to each other and very different when compared to the other cluster groups they are not further divided. Thus, this is how we get DIANA clusters or top-down approached Hierarchical clusters.AGNES Hierarchical ClusteringAGglomerative NESting hierarchical clustering algorithm is exactly opposite of the DIANA we just saw above. It is an inside-out or bottoms-up approach. Here every data point is assigned as a cluster initially if there are n data points n clusters will be formed initially. In the next iteration, similar clusters are merged (again based on the density and distances), this continuous until similar points are clustered together and are distinct for other clusters. Notional representation of AGNESAs we can infer from the above representation for AGNES in step one all the data points are assigned as clusters. In step two again depending upon the density and distances the points are clubbed into a cluster. Lastly, in step 3 all the similar points depending upon density and distances are clustered together which are distinct to other clusters thus forming our final clusters.With so much theory and the working idea of DIANA and AGNES Hierarchical Clustering, we can now move on to build our model in R.The ModelDataThe data we are using today for modeling is an inbuilt R data in the MASS library. The data is named as ‘USArrests’, it has 50 observations (for 50 US States) with 4 variables or data fields. It has arrests per 100,000 residents (the Year 1973) for 4 different data fields which are as follows:$Murder – numeric – Murder arrests per 100,000$Assault – integer – Assault arrests per 100,000$UrbanPop – integer – Percent Urban Population$Rape – numeric – Rape arrests per 100,000Below shows the listing for loading the library followed by loading the USAssaults data set. Later images show the summary and structure of data set generated in R. Loading Data in R Summary and Structure of USArrestsHopkins StatisticOkay, we know we are clustering but, are we sure the data we are performing clustering can be clustered? Will the clusters that will be formed will be valid? To get answers to this question we will compute Hopkins Statistic. It is a measure to determine cluster tendency, the tendency takes a value between 0 and 1. And the cut off for our useful value is below 0.5, if the Hopkins statistic computed is below 0.5 then clusters are likely to be formed and valid clusters and if over 0.5 and closer to 1 the structures are not valid in the data set and are random.The ‘factoextra’ library has the function get_cluster_tendency() {load the library factoextra} this functions returns the Hopkins Statistic for the data. Below is the listing to compute the clustering tendency followed by the Hopkins static obtained. Listing for computing Hopkins Statistic Hopkins Statistic obtainedSince we obtained Hopkins statistic of less then 0.5 which is equal to 0.4141 we can say that the structures in the data set are not random and are real structures out of which clusters can be formed.Determining ‘k’The next big question to answer is what should be the optimal k value i.e. the number of clusters which are to be formed. There will be scenarios where you will be directly given a ‘k’ value depending upon the application. And there are as many as 30+ ways to determine this k-value. The most efficient and excellent way which is making use of NbClust() function of R library. It determines the optimal number of clusters by complex computations by varying all combinations of several clusters, distance measures, and clustering methods all at once. Once all the computations are done the output of k-value is given by majority rule. We can hence use the computed k-value. Below is the listing to compute k value using NbClust(). You may have to install followed by loading of NbClust() if not already in use. Deploying NbClust() Snippet of the summary generated by NbClust()As we can see above, NbClust suggests the optimal number of clusters should we form must be 2. Thus, we will go ahead with this value as k.Modeling DIANANow, we can proceed to model DIANA. For this, we will first load the library Cluster which has the function for both DIANA and AGNES. Below is the listing for computing DIANA and plotting dendogram of DIANA using pltree function. Listing for computing and DIANA and plotting DIANA dendogram Dendogram of DIANA, with rectangles indicating 2 clustersFor DIANA we have a metric called as divisive co-efficient which can be extracted as an attribute from the DIANA model formed as $dc. This is the measure which takes a value between 0 and 1 and closer to 1 the better. Listing for extracting DIANA attribute $dc Obtained Divisive co-efficient $dcAs we can see above the $dc value obtained is very close to 1 thus we can say the clusters are accurately formed and are valid. Now to see the names of the actual clusters formed and their relationships we use function cutree() that groups the clusters which we can use to print the clusters and see the cluster relationships. Using cutree() function to group clusters formed in DIANA Cluster Sizes of DIANAAs we can see from above Cluster 1 has 21 states and cluster 2 has 29 states clustered both making up to 50 states. Further, we can print the clusters and see the relationships between them and understand why the clusters were formed the way they were. Below is the listing for creating a function to print the clusters, followed by printing the DIANA clusters. Listing for creating a function to print the clusters Printing clusters using cluster print function Print of DIANA ClustersIn general comparison of the above two clusters, we can infer that the states are generally divided into two groups. The first group of states in cluster 1 are the states that have high murder, assault and rape cases per 100,000 compared to the states of cluster 2. Below is the mean comparison of Murder, Assault and Rape cases of both the clusters.MurderAssaultRapeCluster 111.825528.1Cluster 24.7109.716.2Thus, from so much inference we can understand that cluster 1 states have more average arrests per 100,000 compared to cluster 2 states. Cluster 2 states are safer then cluster 1 states.Lastly, we will visualize the clusters, we can visualize the cluster results using the function fviz_cluster() from the factoextra library. The plot is in two dimensions and is based on the first two Principal Components that explains the majority variance. Following is the listing for visualization Listing for visualizing DIANA clusters DIANA clusters – VisualizationModelling AGNESSo far whatever we have done for DIANA we will be executing the same for the AGNES clusters as well. Below is the entire listing for modeling, evaluating and visualizing AGNES clusters. Please refer the comments in the listings to get the idea about the listings. Listing for modelling, evaluating and visualizing AGNESAs we saw dc co-efficient in DIANA we have ac (agglomerative coefficient) for AGNES. Again, takes a value between 0 and 1. The closer to 1 the better and here we have 0.90 which is a good ac value indicating the clusters formed can be relied upon. ac co-efficient of AGNES Cluster table of AGNESBelow is the print of clusters formed by AGNES AGNES Clusters printAGNES clusters summary:MurderAssaultRapeCluster 111.8266.328.3Cluster 25.8122.817.8As we can see from the above summary table of clusters, AGNES has formed clusters which are relevant to DIANA in the sense that Cluster 1 states have a higher number of murder, assaults, and rape per 100,000 compared to the states of cluster 2.Below is the fviz_cluster() plot for AGNES AGNES cluster plotTakeawaysMake use of Hopkins statistic to realize the relationship in the data is random or real to form clusters. It should be below 0.5No matter which approaches to take either DIANA or AGNES both will give you clusters with same meaning just the number of entries in the clusters might differBoth are easy to compute and visually you can view the dendogram to understand both of the clustersMake use of fviz_cluster function to visually compute clusters in the first two principal components where majority of information liesConclusionClustering is an unsupervised machine learning algorithm in which we compute analytics mostly without a pre-defined aim to understand the relationships between the data. Once we get the understanding and trends in the data we can accordingly take necessary actions and data-driven decisions. In this case, viewing the states in cluster 1 there can be more strict law enforcement in the cluster 1 states to reduce the number of murders, assaults or rape cases. Police and officials can be more equipped and trained for such situations as well.EndNoteDatum was started with a vision to give the thoughts and working of Data Science and Machine Learning Algorithms. Continuing with this vision The Datum is growing and getting better and would like to thank all the supporters, followers and viewers from more than 65 countries worldwide. We at Datum appreciate this and hope we get continued support.Multiple Datum’s form Data, mindless Data can do wonders. Mindful you and I together can do magic! A source where you can gain/contribute Data Science, Because together we can get better!Your Data Scientifically,The DatumRead more here. See More

IntroductionData Science and Machine Learning are furtive, they go un-noticed but are present in all ways possible and everywhere. They contribute significantly in all the fields they are applied and leave us with evidence which we can rely on and take data-driven directions. Today, a very interesting area we are going to see an example of Data Science and Machine Learning is ‘Crimes’. We are going to focus on types of crimes taken place across 50 states in the USA and cluster them. We cluster them for the following reasons:To understand state-wise crime demographicsTo make laws applicable in states depending upon the type of crime taking place most oftenGetting police and forces ready by the type of crime done in respective statesPredicting the crimes that may happen and thus taking measures in advanceThe above are the few applications which can be considered but they are not exhaustive, depending upon the data reports produced by the algorithms there can be more such applications which can be deployed. Our concern today is to understand how we can deploy Hierarchical clustering in two ways i.e. DIANA and AGNES for the USA Arrests data.TheoryHierarchical ClusteringA clustering algorithm that groups similar objects in the form of clusters. The scope for a detailed understanding of Hierarchical clustering in this blog is not feasible as it is a vast topic. To get a detailed understanding of Hierarchical clusters follow this link. By hitting the above link you will understand in and out of hierarchical clusters concerning their formation, the role of density and distances in hierarchical clustering with visuals diagrams. What we will focus today is on DIANA and AGNES hierarchical clustering.DIANA Hierarchical ClusteringDIANA is also known as DIvisie ANAlysis clustering algorithm. It is the top-down approach form of hierarchical clustering where all data points are initially assigned a single cluster. Further, the clusters are split into two least similar clusters. This is done recursively until clusters groups are formed which are distinct to each other. Notional representation of DIANATo better understand consider the above figure, In step 1 that is the blue outline circle can be thought of as all the points are assigned a single cluster. Moving forward it is divided into 2 red-colored clusters based on the distances/density of points. Now, we have two red-colored clusters in step 2. Lastly, in step 3 the two red clusters are further divided into 2 black dotted each, again based on density and distances to give us final four clusters. Since the points in the respective 4 clusters are very similar to each other and very different when compared to the other cluster groups they are not further divided. Thus, this is how we get DIANA clusters or top-down approached Hierarchical clusters.AGNES Hierarchical ClusteringAGglomerative NESting hierarchical clustering algorithm is exactly opposite of the DIANA we just saw above. It is an inside-out or bottoms-up approach. Here every data point is assigned as a cluster initially if there are n data points n clusters will be formed initially. In the next iteration, similar clusters are merged (again based on the density and distances), this continuous until similar points are clustered together and are distinct for other clusters. Notional representation of AGNESAs we can infer from the above representation for AGNES in step one all the data points are assigned as clusters. In step two again depending upon the density and distances the points are clubbed into a cluster. Lastly, in step 3 all the similar points depending upon density and distances are clustered together which are distinct to other clusters thus forming our final clusters.With so much theory and the working idea of DIANA and AGNES Hierarchical Clustering, we can now move on to build our model in R.The ModelDataThe data we are using today for modeling is an inbuilt R data in the MASS library. The data is named as ‘USArrests’, it has 50 observations (for 50 US States) with 4 variables or data fields. It has arrests per 100,000 residents (the Year 1973) for 4 different data fields which are as follows:$Murder – numeric – Murder arrests per 100,000$Assault – integer – Assault arrests per 100,000$UrbanPop – integer – Percent Urban Population$Rape – numeric – Rape arrests per 100,000Below shows the listing for loading the library followed by loading the USAssaults data set. Later images show the summary and structure of data set generated in R. Loading Data in R Summary and Structure of USArrestsHopkins StatisticOkay, we know we are clustering but, are we sure the data we are performing clustering can be clustered? Will the clusters that will be formed will be valid? To get answers to this question we will compute Hopkins Statistic. It is a measure to determine cluster tendency, the tendency takes a value between 0 and 1. And the cut off for our useful value is below 0.5, if the Hopkins statistic computed is below 0.5 then clusters are likely to be formed and valid clusters and if over 0.5 and closer to 1 the structures are not valid in the data set and are random.The ‘factoextra’ library has the function get_cluster_tendency() {load the library factoextra} this functions returns the Hopkins Statistic for the data. Below is the listing to compute the clustering tendency followed by the Hopkins static obtained. Listing for computing Hopkins Statistic Hopkins Statistic obtainedSince we obtained Hopkins statistic of less then 0.5 which is equal to 0.4141 we can say that the structures in the data set are not random and are real structures out of which clusters can be formed.Determining ‘k’The next big question to answer is what should be the optimal k value i.e. the number of clusters which are to be formed. There will be scenarios where you will be directly given a ‘k’ value depending upon the application. And there are as many as 30+ ways to determine this k-value. The most efficient and excellent way which is making use of NbClust() function of R library. It determines the optimal number of clusters by complex computations by varying all combinations of several clusters, distance measures, and clustering methods all at once. Once all the computations are done the output of k-value is given by majority rule. We can hence use the computed k-value. Below is the listing to compute k value using NbClust(). You may have to install followed by loading of NbClust() if not already in use. Deploying NbClust() Snippet of the summary generated by NbClust()As we can see above, NbClust suggests the optimal number of clusters should we form must be 2. Thus, we will go ahead with this value as k.Modeling DIANANow, we can proceed to model DIANA. For this, we will first load the library Cluster which has the function for both DIANA and AGNES. Below is the listing for computing DIANA and plotting dendogram of DIANA using pltree function. Listing for computing and DIANA and plotting DIANA dendogram Dendogram of DIANA, with rectangles indicating 2 clustersFor DIANA we have a metric called as divisive co-efficient which can be extracted as an attribute from the DIANA model formed as $dc. This is the measure which takes a value between 0 and 1 and closer to 1 the better. Listing for extracting DIANA attribute $dc Obtained Divisive co-efficient $dcAs we can see above the $dc value obtained is very close to 1 thus we can say the clusters are accurately formed and are valid. Now to see the names of the actual clusters formed and their relationships we use function cutree() that groups the clusters which we can use to print the clusters and see the cluster relationships. Using cutree() function to group clusters formed in DIANA Cluster Sizes of DIANAAs we can see from above Cluster 1 has 21 states and cluster 2 has 29 states clustered both making up to 50 states. Further, we can print the clusters and see the relationships between them and understand why the clusters were formed the way they were. Below is the listing for creating a function to print the clusters, followed by printing the DIANA clusters. Listing for creating a function to print the clusters Printing clusters using cluster print function Print of DIANA ClustersIn general comparison of the above two clusters, we can infer that the states are generally divided into two groups. The first group of states in cluster 1 are the states that have high murder, assault and rape cases per 100,000 compared to the states of cluster 2. Below is the mean comparison of Murder, Assault and Rape cases of both the clusters.MurderAssaultRapeCluster 111.825528.1Cluster 24.7109.716.2Thus, from so much inference we can understand that cluster 1 states have more average arrests per 100,000 compared to cluster 2 states. Cluster 2 states are safer then cluster 1 states.Lastly, we will visualize the clusters, we can visualize the cluster results using the function fviz_cluster() from the factoextra library. The plot is in two dimensions and is based on the first two Principal Components that explains the majority variance. Following is the listing for visualization Listing for visualizing DIANA clusters DIANA clusters – VisualizationModelling AGNESSo far whatever we have done for DIANA we will be executing the same for the AGNES clusters as well. Below is the entire listing for modeling, evaluating and visualizing AGNES clusters. Please refer the comments in the listings to get the idea about the listings. Listing for modelling, evaluating and visualizing AGNESAs we saw dc co-efficient in DIANA we have ac (agglomerative coefficient) for AGNES. Again, takes a value between 0 and 1. The closer to 1 the better and here we have 0.90 which is a good ac value indicating the clusters formed can be relied upon. ac co-efficient of AGNES Cluster table of AGNESBelow is the print of clusters formed by AGNES AGNES Clusters printAGNES clusters summary:MurderAssaultRapeCluster 111.8266.328.3Cluster 25.8122.817.8As we can see from the above summary table of clusters, AGNES has formed clusters which are relevant to DIANA in the sense that Cluster 1 states have a higher number of murder, assaults, and rape per 100,000 compared to the states of cluster 2.Below is the fviz_cluster() plot for AGNES AGNES cluster plotTakeawaysMake use of Hopkins statistic to realize the relationship in the data is random or real to form clusters. It should be below 0.5No matter which approaches to take either DIANA or AGNES both will give you clusters with same meaning just the number of entries in the clusters might differBoth are easy to compute and visually you can view the dendogram to understand both of the clustersMake use of fviz_cluster function to visually compute clusters in the first two principal components where majority of information liesConclusionClustering is an unsupervised machine learning algorithm in which we compute analytics mostly without a pre-defined aim to understand the relationships between the data. Once we get the understanding and trends in the data we can accordingly take necessary actions and data-driven decisions. In this case, viewing the states in cluster 1 there can be more strict law enforcement in the cluster 1 states to reduce the number of murders, assaults or rape cases. Police and officials can be more equipped and trained for such situations as well.EndNoteDatum was started with a vision to give the thoughts and working of Data Science and Machine Learning Algorithms. Continuing with this vision The Datum is growing and getting better and would like to thank all the supporters, followers and viewers from more than 65 countries worldwide. We at Datum appreciate this and hope we get continued support.Multiple Datum’s form Data, mindless Data can do wonders. Mindful you and I together can do magic! A source where you can gain/contribute Data Science, Because together we can get better!Your Data Scientifically,The DatumRead more here. See More

IntroductionHow often are you in a situation where you have 2 alternatives either yes or a no, black or a white, and so on. These are instances where you ‘classify’ your scenario into only two solutions, a number of solutions may vary but usually, they are two solutions. This is what we call as ‘Classification’ we classify the outcomes in a set number of instances usually two. This week at The Datum we have how can we use Neural Networks as the classification model. And once we have the model in hands we will go about prediction using the model and lastly, we will evaluate our model and predictions for its rightness.Model BriefToday we have a very interesting model to build which can classify breast cancer. Cancer is uncontrolled multiplication of cells. And breast cancer is caused due to the multiplication of cells in the mammary gland that are transformed into malignant cells. They have the ability to detach themselves from the tissues which they are formed and invade into the surroundings due to multiplication. Breast is formed from multiple types of cells that form the breast but, the most common breast cancers are from glandular cells or from those forming the walls of the ducts.The objective of this model today is to classify the number of benign and malignant classes which form the two most common type of breast cancer. For this model, we will be using the data already present in R named BreastCancer which is available in mlbench package. Using this data we will classify benign and malignant types of breast cancer using neural networks as a classifier.Model DataThe data as mentioned is readily available in R library of mlbench and each variable except the first, is in the form of 11 numerical attributes with values ranging from 0 through 10, with some missing values as well. Following are the data fields of this data set.$Id: Sample code number $Cl.thickness: Clump thickness $Cell.size: Uniformity of cell size $Cell.shape: Uniformity of cell shape $Marg.adhesion: Marginal adhesion $Epith.c.size: Single epithelial cell size $Bare.nuclei: Bare nuclei $Bl.cromatin: Bland chromatin $Normal.nucleoli: Normal nucleoli $Mitoses: Mitoses $Class: ClassWith good amount of information on the model brief and data we are good to implement the Neural Network Classifier model in R. To have an in-depth understanding of Neural Network concepts and functioning visiting the blog Prediction Analysis with Neural Networks and Linear Regression is recommended (here), which has all the information needed to understand the working of Neural Networks. To have complete access to this blog with entire listings, data and building the algorithm visit, read here. See More

]]>