The two main data types in business are nominal (categorical or qualitative data) and interval data (quantitative or continuous data). Nominal data are just categories on variables such as customer names, and marital status and you cannot do any mathematical operations on this type of data. Bar chart and Pie chart are usually used to describe nominal data. On the other hand, interval data hold numerical values on variables such as income, age, and invoice amount and you can do mathematical operations on them. Histograms are commonly used to describe interval data.
Classification is the fundamental activity in Management
Given that classification is the fundamental step in management, a specific variable that holds interval data can be classified into different categories i.e. transformed into nominal data. For example, a telecom company like Bell can classify their customers into distinct groups based on the billing amounts (which is an interval data variable). Let’s say, Bell gets billing amount data of 200 customers in a specific geographic area as shown in the table. How can his data be categorized?
Sturges’s rule can help in determining the number of groups i.e. classifying the data in the interval data set.
Below are the key steps in classifying the interval data set or rather transforming the data from interval type to nominal type.
Step 1: Find the Range in the data set
Range = Max Value – Min Value = $129.63 – $10 (say) = 119.63
Step 2: Apply Sturges’s rule to determine the number of classes
# of Classes = 1 + 3.3 (log n); where n is the number of observations
# of Classes = 1 + 3.3 (log 200) = 1 + 3.3*2.3 = 8.5 = 8 groups (You can select 9 if you prefer)
Step 3: Determine the Class Width
Class Width = Range/Number of Class = 119.63/8 = 14.95 = 15 (rounded)
This means there will be 8 groups/classes which are separated by $15.
Step 4: Use Excel to Plot the Histogram (and get the frequency of customers in each of the 8 classes)
Make sure that you have the “Data Analysis Toolpak” downloaded in XL. Then go to Data –> Data Analysis – >Histogram.
The output (in a new tab) is as shown.
Step 5: Clean up the Table and the Histogram
Make sure that you have selected the bar (in the Histogram) and click on “Format Data Series”. Then reduce the “Gap Width” from 150% (default) to 0%.
The final/clean Histogram with 8 groups is as shown.
Now you have sub-divided your customers into 8 different homogenous classes based on billing amount and perhaps you can have promotional events specific for a group of customers; say the ones in class 1 (the group with maximum frequency/customers of the 8 classes).
Classification holds the key to good management. While you might be able to capture large amounts of time-series/continuous data, categorizing data is a fundamental building blocks for deriving insights and pursue appropriate actions.
Prashanth Southekal brings over 20 years of Data and Information Management consulting/working for companies such as SAP AG, Shell, Apple, P&G, and General Electric. He has published two books on Information Management including the most recent "Data for Business Performance". Please connect with him at LinkedIn or email him at [email protected] for a no obligation discussion on transforming you business data into a monetizable asset.
Comment
Clarifications for completeness and accuracy (always a good thing when talking about data):
o The four types of data are nominal, ordinal, interval, and ratio.
o Depending on your business, all four may be commonly used.
o Information content of the data *increases* going from nominal to ratio (very desirable trait). Other than binning data to construct histograms, it is generally not a good practice to transform data to a lower information-content type: results in loss of information).
o Income, age, and invoice amount are examples of ratio data: they have an absolute zero value and two values may be expressed as a ratio such that, for example, 40 is twice as old as 20 (as is 10 vs. 5), $100K is twice the income of $50K. [Unlike interval data, such as temperature: 40 degF is not twice as hot as 20 degF, but the "distance" (interval) from 20 to 40 degF is the same as the distance from 60 to 80 degF.]
Thanks for the response.
In order to feed as an input to a neural network.
Like one way to convert a categorical column is to use one hot encoding. What other techniques are commonly used?
Hi Nitin,
I have not come across any situation at work where I had to convert non numerical values to numerical values. I personally believe, changing the format and the values will affect data integrity. But curious to know your use case. Why would you want to convert non numerical values to numerical values ?
Cheers!
P
Is there a list of strategies to convert non numerical columns of a table to numerical values? For example, word2vec is a nice algorithm which assigns vectors to English words.
© 2018 Data Science Central™ Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central