I find that different types of surveys represent a large source of data for many organizations: client questionnaires; recruitment interviews; incident debriefings; interrogations; borehole drilling surveys; quality control checks; marketing surveys; security and patrol logs; and inventory audits. I believe that for many people, the idea of collecting information using surveys makes sense; and they recognize the need for the data. Problems arise in relation to the transition from survey to database. The information gets stored somewhere; then there is a need to make sense of the data. "Ah, I just use a database program like Access to store the records. Then I can index the records and prepare presentations from the data." I don't dispute how some surveys can be handled by certain commercial software packages. For a recruitment interview, can something like Access be used to examine the responses? In my opinion, not well. The software can handle the demographic data if it were legal to collect this sort of data from job candidates. (Canada and its provinces have human rights legislation preventing discrimination based on ethnicity, age, religion, disability.) In this blog, I will be discussing how the structure of a questionnaire gets converted into relational format. I realize that this is a fairly straightforward process for some people. Then I will consider how to make use of details that are relationally evasive: e.g. "I went hiking in Tibet. I think I really found myself. A leader who doesn't know him- or herself can't possibly guide an organization struggling to find its own identity." Don't let this get converted to "hobbies = hiking" on a database.
One the earliest classes that I attended in university covered how to do pubic surveys. I didn't realize at the time that I would be periodically drawing guidance from this class over the course of my adult life. Excluding general research, I have done a number of formal surveys: four academic (three of these public); and three non-academic (one of these for production). In each case, there was a specially prepared "survey document," resulting in survey data and a database. In general when doing a survey, each possible choice should translate into a separate and unique "code." In relation to a spreadsheet, this coding refers to fields on the table. A data-management program that converts a No/Yes response into either a 0 or 1 in a single column (a field, variable, or header) makes it possible to tabulate based on inference rather than explicit count. Presumably the total equals the numbers of Yes responses; it is possible to deduce the number of No responses. In order to demonstrate how this is inferential, consider a No/Yes/Maybe response being converted into 0, 1, or 2 in a particular column. When adding up this single column, the decontextualized totals don't provide an explicit count: i.e. total numbers for No, Yes, and Maybe. It is only possible to infer the constitution of the data; and the inference is likely to be incorrect (except maybe through inferential modeling). However, if a separate column exists for each possibility, and each incident (cell) is noted with a 1, the resulting totals from these columns are correct since they are determined explicitly.
Similarly, an anchored behavioural scale might seem to present a perfect opportunity for additive analysis: e.g. does job = 1; demonstrates abilities = 2; shows initiative = 3; supports others = 4; and shows leadership = 5. If the data is maintained not as numbers per se but incidents of events, there will probably be no loss of data or misunderstanding of meaning. The behavioural anchors should be handled as separate events or fields. When a data-management program - for instance, a client accounts system - has a dropdown menu with different options, it can be tempting for that system to insert the chosen option into a single column. Although this makes sense purely from the standpoint of data retention, it doesn't support data processing well. It would be necessary for the interpreting program to recognize strings and conduct a tabulation based on many conditionals: e.g. if this line says "Satisfied" then add 1 to total. Or it might be necessary to sort and do subtotals. I am not saying this is necessarily a bad approach; in fact, I will be explaining shortly how this might make sense in certain circumstances. Nonetheless, the efficiency of having a relational database would be reduced if the processor had to compare a lot of conditions. Having to update and reindex a large database repeatedly for production purposes might at some point lead to chopped liver. (I'm unsure if this is directly related not having inside information on the data involved, but this article captures what I mean by chopped liver.)
If a separate field is used for each possible response, then a survey of 20 questions containing 5 options each will require 20 x 5 = 100 columns. However, a production facility with high quality standards might have several hundred checking point for a particular product; multiply this many times over if something complicated is being assembled from the parts. The database can therefore quickly become challenging for a human to handle manually. Moreover, because most of the fields will contain 0s - since a response with 5 options will lead to a single positive and four 0s - the relational structure seems wasteful. It would be useful therefore simply to "attach" an event code for each positive. Those reading this blog carefully might point out the following: if one does precisely this, then there is no point having a separate field for each possible choice. True enough, in a "non-relational" environment where the processor is designed to tabulate based on recognizable event codes, there might not be a need to have a separate field for each possible choice. This brings me to the question of why somebody might consider a non-relational database to store survey data.
Handling Open-ended Input
When I was taught to do public surveys, I was told to gather all of the surveys at the end of the collection period and try to identify common themes in the open-ended responses. Having a data-gathering end-date would be useful in terms of control: the resulting table, no matter how large, would at least have a fixed number of fields across. However, in business, there might not be an "end to the collection period." One is therefore faced with the challenge of determining the nature of all open-ended responses even before they have been collected. Accepting the meaning of "open-ended" at face value, a person should not expect to be able to anticipate what sort of responses would be submitted. Consequently, having a fixed number of columns on a database is problematic. Even if we ignore the expansion constraints, there is the time-consuming process of deliberation during the design of the database that I personally consider wasteful: "So, what should go in column 1? Er, column 2 . . . column 536 . . . column 1248?" The responses are open-ended. We don't know what to expect. Moreover, any attempt to constrain the response also limits our ability to conceptualize the phenomena. "This act of terrorism . . . involved what . . . trees, doors, hydrants, perfume . . . um?" The analyst ticks off a bunch of boxes on a survey. Reality gets shaped by these pigeon-holes, which reduce the resolution of reality but do not enhance understanding.
I think many would agree, given that the number of open-ended responses might be lengthy, the fields for the open-ended responses should be to the right of the fields for fixed responses; this allows for spill-over to the right. This is an easy sell. The real problem is the potentially massive number of blanks or 0 fields for each record and, as already mentioned, the end of the table in terms of the number of fields across. In relation to open-ended questions, it is probably worthwhile when dealing with large surveys to consider a non-relational environment. This means coding from scratch or using some kind of commercial package that handles data using event symbols rather than fields on a table. Once a person can freely use an unlimited number of event codes (in a non-relational environment), there will likely be questions relating to ontological delineation. I consider the attachment of event symbols to unstructured input something of a philosophical exercise. When is something important enough to exist? How does the existence of events or lack thereof affect outcomes? Consequently even with a survey that seems relatively straightforward on the surface, there are issues of organizational strategy. Ontology is close to the front-line of production systems, helping to shape outcomes.
What Is a Survey Anyways?
I gave a fairly lengthy list of different types of documents at the beginning of the blog, all of which I described as types of surveys. It would be reasonable to ask me what exactly I mean by a survey. Well, let's consider this question in relation to a "surveyor" - such as a surveying algorithm or full-blown data extraction program. In this case, the documentation serves as a little snapshot of reality to feed the algorithmic processes. The survey is that which we consider useful for the attachment of meaning. In my previous blog, I described these stubs or channels as "analogs." Research is a kind of abstraction powered by these analogs. For me, surveys function as ontological constructs. Therefore, the analogs that I described supporting algorithms are ontological in nature. How I choose to make data recognized or recognizable affects the algorithm or conversion process making use of the data. Consequently, surveys are types of analogs. (I might be the only person who interprets surveys in this way.) As I pointed out in my previous blog, an analog isn't just a channel of data but also a means of geometric and structural conveyance. For example, consider a survey attempting to establish the presence of "oppression," which is no simple metric. There are structural similarities between scenarios that exist only relationally but not in terms of the individual components. Racial discrimination might not have all the same components as gender stereotyping; but there is likely relational congruence supporting certain preconceptions of oppression. An analog or survey can therefore be designed to concentrate on the geometric patterns.
Canada will be having a federal election in a couple of weeks. The governing Conservatives have proposed a "snitch line" for Canadians to report incidents of barbaric practices among certain people living in Canada. The government likely means immigrants from countries where stoning might still be practiced - where honour killings and forced marriages have been known to take place. (Canada has an exceptionally diverse population.) It is debatable what form the documentation will take to log calls to such a snitch line. However, there will likely be input from callers normally associated with open-ended survey responses. Earlier on, I described how every possible response to a survey question should convertible into code; further, each code can be the header for a column on a relational database. I also said that codes can be attached to input for events on a non-relational database. Attachment of event coding can occur irrespective of the underlying data on any form. Consequently, an algorithm sifting through snitch-line records can extract event coding. I will also mention here that a relational database approach, if one were used for a snitch line, would probably limit ontological construction.
I believe the term "survey" tends to be associated with the tangible paper that people fill out. But of course this paper can be electronically simulated. Then we must ask ourselves what makes an electronic image a survey. When police patrol a neighborhood from a cruiser, their use of visual inspection represents a way of surveying the community. It is not really a form per se that creates a survey but rather the gathering of symbolic proxies, representations, and renderings recognized to be significant or important for a particular purpose. Now, a snitch line is actually an interesting example because there are two competing perspectives: 1) what the government considers important; and 2) what people calling the snitch line consider worth mentioning. The ontological demands are far greater in the case of the latter. In a manner of speaking, if a snitch line were to use a relational database to record only those specific events important to a government, the information collected might only reinforce the government's perspective, thereby insulating or alienating it from the underlying phenomena. Of course, some might argue that the government's perspective should be the main perspective. I'm certainly no authority on national priorities. But if the idea is to gather data regarding barbaric practices, it probably makes sense to go beyond any kind of fixed relational structure - for the sake of gathering data. Pigeon-holing everything might lead to loss of useful details. I do appreciate however that this practice sometimes serves to make the input manageable.
Retention of Structural Data
Having escaped the confines first of surveys and second of their associated relational databases, one starts to regard input as an opportunity to distribute event coding for later tabulation; or for an algorithm the input might be a resource to extract and tabulate event coding. Of course, if an organization only has spreadsheets, any further discussion on event coding is rather superfluous. But let us consider a situation where event coding can be attached to input. Not only this, imagine that we have the ability to skim through many thousands of event codes and also to rapidly match the correct codes with the evidence on the input. On a primitive level, one might try to match coding with specific objects in the input. For example, if a school is mentioned in the input, the code for the school might be attached to the text. Or, there might be interesting word combinations. Each combination can also be associated with a code. An ordered list, a concept, euphemism, derogatory remark, compliment are all things that can be delineated by coding. The structure or logic of comments can be codified and the resulting objects redeployed over future input.
I consider surveying a challenging task of ontological construction. There is no limit to what we can attach to the input - no ceiling in terms of conceptualization. I believe that most organizations are poorly suited - and certainly poorly equipped - to handle data at this structural level. Perhaps a fair number still struggle just using a traditional relational approach - taking count of the filled pigeon holes. I don't want to diminish their efforts. I just want to point out that there are other options if different technologies are considered. I doubt that companies would necessarily benefit from initiating in-house development in the direction described here. In-house development is sometimes expensive and organizationally taxing. I am also uncertain of a commercial package that does the job. I doubt it, simply because the return from selling such a product might be inadequate. Perhaps outsourcing is the best course of action possibly creating a market for some data scientists. Surveys are worthwhile data resources. However, the transformation from survey to data involves choices that can influence the usefulness of such intellectual assets. It is important to recognize how ontological construction can both diminish and enhance the value of surveys. Ultimately in order to be guided by the market, production has to be sensitized to its needs.