This article was written by S. Richter-Walsh. S. Richter-Walsh likes to write about environmental science and also enjoy posting interesting data analyses and visualisations.
A very useful machine learning method which, for its simplicity, is incredibly successful in many real world applications is the Naive Bayes classifier. I am currently taking a machine learning module as part of my data science college course and this week’s practical work involved a classification problem using the Naive Bayes method. I thought that I’d write a two-part piece about the project and the concepts of probability in order to consolidate my own understanding as well to share my method with others who may find it useful. If the reader wishes to get straight into the Bayes classification problem, please see Part 2 here.
The dataset used in the classification problem is titled “Breast cancer data” and is sourced from Matjaz Zwitter and Milan Soklic from the Institute of Oncology, University Medical Center in Ljubljana, Slovenia (formerly Yugoslavia). It is an interesting dataset which contains categorical data on two classes with nine attributes. The dataset can be found here. The two outcome classes are “no-recurrence events” and “recurrence events” which refer to whether breast cancer returned to a patient or not having previously being diagnosed with the disease and treated for it.
Our goal was to use these data to build a Naive Bayes classification model which could predict with good accuracy whether a patient diagnosed with breast cancer is likely to experience a recurrence of the disease based on the attributes. First, I will briefly discuss the basic concepts of probability before moving on to the classification problem in Part 2.
To check out all this information, click here.
Top DSC Resources