Subscribe to DSC Newsletter

Terminology question - what kind of categorization is this?

I'm just beginning a formalized effort to study & learn data science, but (like a lot of people) this is happening long after my first seat-of-the-pants effort to actually do it.  Which means that now I'm reviewing that work, trying to understand whether it was even valid, and if so the proper descriptions of the techniques I was using.

The quick backstory is that I was analyzing data relating to process deviations.  A worker, being aided by a software application, had an "ideal process" that he was to follow for a repetitive task.  However there were three basic categories of deviations that could occur:

  1. Worker needed to stop & get further information to complete task correctly
  2. Worker started to perform the task incorrectly, but then corrected himself before completing it
  3. Worker completed the task incorrectly, was alerted to it by a software error message, and had to repeat the task from the start

I had a software log file recording about 1000 instances of this task being performed.  When I first set out to explore this data, the initial questions I had in mind were:

  • How common is it that the worker has a deviation of any kind, vs. adhering to the ideal process?
  • How often are each of the three particular deviations encountered?
  • How often do the various deviations occur in concert with each other?  (E.g. worker started to perform the task incorrectly, then requested further information, and having done that he completed it correctly.)

The approach I settled on to answer these questions was to assign each deviation a "point value," where the points were chosen to correspond to a numerical bit position.  Deviation #1 was 1 point, #2 was 2 points, and #3 was 4 points.  Applying these to each task instance yielded a "score" of 0 to 7, based on whether or not each deviation occurred for that instance.

(Sidebar... I was quick to pat myself on the back for this clever approach.  It offered an efficient, compact, yet lossless way to characterize the 8 possible combinations, while making it just as easy to combine, say, all the instances where #2 occurred whether alone or in concert with other deviations. A few histograms was all it took to answer my exploratory questions and provide useful insights to guide further analysis.)

Now that I'm starting the formal data science training, I'm trying to decide how to properly describe the technique I used.  It doesn't seem to fit the definition of ordinal, nor is it one hot encoding because the three categorizations are not exclusive.  Would it be considered binary encoding?  Or, is there (maybe) such a thing as multiple hot encoding?

Views: 55

Reply to This

Follow Us


  • Add Videos
  • View All


© 2018   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service