Subscribe to DSC Newsletter

Transforming Quantitative Data to Qualitative Data

The two main data types in business are nominal (categorical or qualitative data) and interval data (quantitative or continuous data). Nominal data are just categories on variables such as customer names, and marital status and you cannot do any mathematical operations on this type of data. Bar chart and Pie chart are usually used to describe nominal data. On the other hand, interval data hold numerical values on variables such as income, age, and invoice amount and you can do mathematical operations on them. Histograms are commonly used to describe interval data.

Classification is the fundamental activity  in Management

Given that classification is the fundamental step in management, a specific variable that holds interval data can be classified into different categories i.e. transformed into nominal data. For example, a telecom company like Bell can classify their customers into distinct groups based on the billing amounts (which is an interval data variable). Let’s say, Bell gets billing amount data of 200 customers in a specific geographic area as shown in the table. How can his data be categorized?

Sturges’s rule can help in determining the number of groups i.e. classifying the data in the interval data set.

Below are the key steps in classifying the interval data set or rather transforming the data from interval type to nominal type.

Step 1: Find the Range in the data set

Range = Max Value – Min Value = $129.63 – $10 (say) = 119.63

Step 2: Apply Sturges’s rule to determine the number of classes

# of Classes = 1 + 3.3 (log n); where n is the number of observations

# of Classes = 1 + 3.3 (log 200) = 1 + 3.3*2.3 = 8.5 = 8 groups (You can select 9 if you prefer)

Step 3: Determine the Class Width

Class Width = Range/Number of Class = 119.63/8 = 14.95 = 15 (rounded)

This means there will be 8 groups/classes which are separated by $15.

  • Class 1 = $0 to $15 billing
  • Class 2 = $16 to $30 billing
  • Class 3 = $31 to $45 billing
  • Class 4 = $46 to $60 billing
  • Class 5 = $61 to $75 billing
  • Class 6 = $76 to $90 billing
  • Class 7 = $91 to $105billing
  • Class 8 = $106 to $120 billing

Step 4: Use Excel to Plot the Histogram (and get the frequency of customers in each of the 8 classes)

Make sure that you have the “Data Analysis Toolpak” downloaded in XL. Then go to Data –> Data Analysis – >Histogram.

Enter the data as shown.

The output (in a new tab) is as shown.

Step 5: Clean up the Table and the Histogram

Make sure that you have selected the bar (in the Histogram) and click on “Format Data Series”. Then reduce the “Gap Width” from 150% (default) to 0%.

The final/clean Histogram with 8 groups is as shown.

Now you have sub-divided your customers into 8 different homogenous classes based on billing amount and perhaps you can have promotional events specific for a group of customers; say the ones in class 1 (the group with maximum frequency/customers of the 8 classes).

Classification holds the key to good management. While you might be able to capture large amounts of time-series/continuous data, categorizing data is a fundamental building blocks for deriving insights and pursue appropriate actions.

Prashanth Southekal brings over 20 years of Data and Information Management consulting/working for companies such as SAP AG, Shell, Apple, P&G, and General Electric. He has published two books on Information Management including the most recent "Data for Business Performance". Please connect with him at LinkedIn or email him at [email protected] for a no obligation discussion on transforming you business data into a monetizable asset.

Views: 3956

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Wayne G Fischer on February 5, 2018 at 10:20am

Clarifications for completeness and accuracy (always a good thing when talking about data):

o  The four types of data are nominal, ordinal, interval, and ratio.

o  Depending on your business, all four may be commonly used.

o  Information content of the data *increases* going from nominal to ratio (very desirable trait).  Other than binning data to construct histograms, it is generally not a good practice to transform data to a lower information-content type: results in loss of information).

o  Income, age, and invoice amount are examples of ratio data: they have an absolute zero value and two values may be expressed as a ratio such that, for example, 40 is twice as old as 20 (as is 10 vs. 5), $100K is twice the income of $50K.  [Unlike interval data, such as temperature: 40 degF is not twice as hot as 20 degF, but the "distance" (interval) from 20 to 40 degF is the same as the distance from 60 to 80 degF.]

Comment by Nitin Pasumarthy on February 5, 2018 at 9:34am

Thanks for the response.

In order to feed as an input to a neural network. 

Like one way to convert a categorical column is to use one hot encoding. What other techniques are commonly used?

Comment by Prashanth Southekal, PhD on February 5, 2018 at 9:27am

Hi Nitin,

I have not come across any situation at work where I had to convert non numerical values to numerical values. I personally believe, changing the format and the values will affect data integrity.  But curious to know your use case. Why would you want to convert non numerical values to numerical values ?

Cheers!

P

Comment by Nitin Pasumarthy on February 1, 2018 at 4:15pm

Is there a list of strategies to convert non numerical columns of a table to numerical values? For example, word2vec is a nice algorithm which assigns vectors to English words.

Videos

  • Add Videos
  • View All

Follow Us

© 2018   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service