Home » Uncategorized

Dealing with Unstructured Input

It isn’t too unusual for surveys to contain open-ended questions:  respondents would be free to enter their comments in any manner.  More on an operational basis, not surveys but rather client systems might hold such comments; and not the respondents themselves but customer service agents would be responsible for entering the information.  These same agents would likely classify the nature of the exchange or comments maybe using drop-down menu choices, radials, and check-boxes.  One approach to deal with the comments is to use tools to interpret the text and compile the distribution of keywords and keyword combinations.  Assuming resources are available, if there are many thousands of comments each day and the nature of the comments are already known in advance, perhaps the most viable approach to study the input is to use algorithms.  However, in this blog I will be exploring the tiresome manual approach – in order to raise some concepts.

At the outset, I just want to point out that humans can still do the analysis directly even if there is a huge amount of data simply by doing random sampling.  (I am referring specifically to the examination of comments.)  The compilation of totals, which requires accurate balances, should of course still be done by a machine.  There is certainly no reason why the machine cannot sift through the comments to arrive at metrics.  The human should later go through the comments to extract more elaborate data.  The idea of putting the fate of an organization entirely on the evaluation of words and word-combinations to me seems outlandish.  I suspect that the most likely proponents are those that are sold on the idea of using algorithms, want to appear intelligent, but who don’t have any programming skills.  They are unable to distinguish good ideas and results from bad.

Since there is a “continuous” inflow of data, unstructured input from a client system isn’t like data from a public survey where a submission deadline is applied or imposed.  Since the inflow is continuous, so too is the outflow.  I note the rather unsettling notion of having enough data to get the gist of the entries – thus eliminating the need to continue collecting data over a longer time period.  This practice might be relevant within certain contexts.  But it is definitely irrelevant in terms of evaluating a perpetual flow of client comments.  The business landscape might be changing.  Possibly client tastes are steering towards the competition.  Some part of the organization might be failing – e.g. its ability to deal with warranty claims.  Being sensitive to change is of primary concern – it is not just a passing consideration.  It doesn’t take long for a business to discover clients turning away from existing product lines.  Even the finest horseshoes in the world sell poorly in a world of cars.

Creating Categories

Imagine a gushing flow of data from client comments.  It is probably not time to deploy that algorithm yet; or maybe for some people it is.  For me, the first major concern is to determine whether to externally define the meaning of the data or to allow the meaning to be internally conceived by the data.  If I intend to externally define the meaning, I create the categories in advance even before reviewing any data; then as I go through the data, I put them in their designated categories.  If I allow the data to determine its own meaning, I create categories based on the contents of the submissions, which has the potential of altering the schema of categories and subcategories.  In practice, I suspect that a combination of the two might be most desirable.  A combined method balances the instrumental needs of the organization with the underlying meaning of the data.

Determining Distribution

Irrespective of how the categories are defined, my next question normally relates to whether some categories can be used to safely distribute the responses.  Some responses will tend to fit in more than one category.  If the categories are chosen such that only a single response is possible or likely in each case, this schema reduces complexity.  For example, a commercial vehicle can be a car, truck, van, bus, taxi, streetcar, subway, boat, helicopter, or plane.  Is it better to have a category car that might contain a subcategory of commercial vehicle; or is it better to have a category commercial vehicle that might have a subcategory of car?  Well, some vehicles can only be considered commercial; or they seem likely to be commercial.  Therefore, if one has a vehicle that is unlikely to be anything else but commercial, needless complexity is created by having the type of vehicle as the main category and the nature of use as of the subcategory.  Sure, it is “possible” that there is a freight-train, ocean-liner, or passenger plane for personal use – but highly improbable.  Anyways, just give it some thought.

Asserting Designations

The clients express themselves in the data if their responses are allowed to shape the data.  The company expresses itself in the data if the responses are forced into pigeon holes.  These two positions seem entirely opposed to each other.  However, I believe in using what I call “precompiled interest” to designate the data.  In this case, the responses exist plainly; but, the interests of the organization are used to pre-populate (i.e. pre-compile) specific categories in relation to the responses.  In effect, I survey myself on matters that cannot be “directly” taken from the client responses: e.g. “vulnerability,” “liability,” or “opportunity.”  Why not have customer service agents make such determinations?  Apart from the complex ontology, they might not have access to the supporting criteria or proper training to make reasonable assessments.  I also doubt that they want to want to spend a great deal of time with the data.  Consequently, my third focal point would be the packaging, wrapping, or labeling on the data.

Agility Matters

A little idiosyncratic is the exact process to use for the above-mentioned considerations.  I leave this for later since it actually depends on the resources available and certain philosophical inclinations.  It many cases, spreadsheets work fine although it is necessary to make use of many.  Spreadsheets allow for flexibility and ease of change; but it takes discipline and training to maintain whatever system emerges.  The latter point can be said of integrated environments.  I know that there has been a movement towards more integrated approaches.  I don’t have issues with integration if it is being initiated by me.  If somebody imposes an integrated environment on the process, the data might lose the ability to express the needs of the client.  When dealing with unstructured comments, if at all possible, I personally would be thinking small, light, and agile.

Missing Occupation

Periodically I would come across an article written by somebody proposing that a new profession be recognized.  On the other side of the coin, I still recall a coworker apparently agitated by the lack of documentation to help employees do each other’s work.  She said that a person should be able to replace anybody else if proper documentation exists.  I pointed out that this theory might be difficult to apply to specialists.  I am taking the middle position in relation to the handling of client comments:  there might not be a missing “profession,” but rather there is or should be an “occupation.”  I recognize that some people within a profession might not be able to obtain the desired results in a particular occupation.  The difference is that an occupation rather than profession emerges from the needs of a particular business.  I am suggesting that in order to handle unstructured comments in a meaningful manner, it is necessary for the person occupying a particular position to have exceptional knowledge of his or her business.  Parachuting an external knowledge base into the business and expecting him or her to know how client comments fit into the picture is an idea that seems overly optimistic.

Missing Profession

The ability to make “effective use” of unstructured input delivers a type of intelligence.  Obtaining insights is only the beginning of the process of making effective use.  The term “use” in this case for me means the allocation and deployment of capital.  It is important to distinguish between the science of articulating phenomena and the science of managing situations.  The same way the inflow of data is continuous, so too must the deployment of capital adapt to changing conditions.  These are dynamic dynamics – not static dynamics.  It might not be obvious which people are best suited both to collect data and offer suggestions in terms of responses.  The interesting nature of unstructured input is how its meaning can potentially be unconfined.  There are different levels of consequences:  e.g. operational, strategic, political, public safety, and public relations.  Competing interests make analysis more than a simple technical process.  It is possible to evaluate spectra of impacts between recognized events and performance metrics through the use of technologies and methodologies that I myself have developed over the years: e.g. crosswave differential algorithm, event model, and supporting body of applications.