I'm just beginning a formalized effort to study & learn data science, but (like a lot of people) this is happening long after my first seat-of-the-pants effort to actually do it. Which means that now I'm reviewing that work, trying to understand whether it was even valid, and if so the proper descriptions of the techniques I was using.
The quick backstory is that I was analyzing data relating to process deviations. A worker, being aided by a software application, had an "ideal process" that he was to follow for a repetitive task. However there were three basic categories of deviations that could occur:
I had a software log file recording about 1000 instances of this task being performed. When I first set out to explore this data, the initial questions I had in mind were:
The approach I settled on to answer these questions was to assign each deviation a "point value," where the points were chosen to correspond to a numerical bit position. Deviation #1 was 1 point, #2 was 2 points, and #3 was 4 points. Applying these to each task instance yielded a "score" of 0 to 7, based on whether or not each deviation occurred for that instance.
(Sidebar... I was quick to pat myself on the back for this clever approach. It offered an efficient, compact, yet lossless way to characterize the 8 possible combinations, while making it just as easy to combine, say, all the instances where #2 occurred whether alone or in concert with other deviations. A few histograms was all it took to answer my exploratory questions and provide useful insights to guide further analysis.)
Now that I'm starting the formal data science training, I'm trying to decide how to properly describe the technique I used. It doesn't seem to fit the definition of ordinal, nor is it one hot encoding because the three categorizations are not exclusive. Would it be considered binary encoding? Or, is there (maybe) such a thing as multiple hot encoding?