Like many students about to finish their undergraduate degree, I decided to artificially inflate my grades by taking some "bird courses." These are not courses about birds. Other students assured me that the courses were designed to bolster my marks and to help me complete my program requirements. Considering the many bird courses available, I decided to take introductory music, which was essentially a history course focused on music. It required a lot of listening. I had to attend some live performances. This course turned out to be one of the most enjoyable I ever took. It never covered the relationship between music and math, but it certainly drew strong connections between culture and music. Writing as a person that can program in several computer languages, I personally find some similarities between programming and music; this is not to say that I can play a musical instrument. In terms of production environments, I think that a person would have to spend some time on the factory floor to appreciate its rumble and patter - the patterns shaped by philosophies, policies, practices, and procedures. Even if one cannot hear the music emanating from a city, time-lapse photography would reveal a chorus of lights, flows of vehicular and pedestrian traffic, the hustle of people in and out of buildings. There are systems upon systems, sub-systems upon sub-systems. We are surrounded by music. Our hearts beat inside, but we ourselves are part of a large machine that pulsates all around. Further out, there are even deeper beats in the ecosystems and habitats that we occupy. When we think about data, we take into account aspects that are near the surface - for this is where we exist. We don't always notice how we help to form and sustain massive structures - how rivers of data flow through us.
I was born on a tropical island. There are bananas, pineapples, and coconuts still growing there. There are natural waterfalls, monkeys, snakes, and giant birds. In Toronto where I live, the biggest local bird is probably the Canada goose. That's a pretty big bird. The Philippines is the home for large predatory birds, which I think would rather enjoy eating the Canada goose. I never thought the call of the Pacific would still reach me so many years after my separation. But for as long as I can remember, I have always been drawn to the water. Just imagine being in a fishing village with the Pacific and possibly volcanoes nearby. I sometimes wonder if a person can experience sympathetic vibrations near massive bodies of water. I know I was talking about music from the city just a moment ago. But I want to add different dimensions to this discussion. What if the depth of something enormous like the Pacific could form the basso continuum to one's inner heart beat or the patterns of one's deepest thoughts? A person's pulse might never be complete unless that faint elemental base is nearby - no matter how obscure or entwined it might be to sounds and sensations of daily life. In this blog, I will be discussing the "depth" of data: its internal and external rhythms. I think that due to our need to accomplish difficult goals and endure challenging situations, we sometimes become estranged from the deep rhythms that influence the success of our decisions in organizational settings. This is not a discussion about specific settings but rather the alienation of competing rhythms resulting from our reluctance to deepen our use of data.
Why this Discussion Matters
I believe that there has always been a strong movement to simplify process and approaches in many aspects of production. There is no reason to think that data science is being treated differently. Automated programs can both gather data and provide statistics, day-after-day perhaps around the clock, assuming that the underlying data is quite shallow. Shallowness reduces the need for creative intervention. When a computer programmer can - through sophisticated processes, mimic the behaviours and decisions of experts - it seems likely that organizations would applaud the opportunity to streamline operations. This of course means eliminating redundancies and rationalizing surplus capacity. Routine website traffic analysis might be handled by a computer better than any person. I therefore wish to emphasize that if a person can perform the same tasks as a computer, the computer will get the job if at all possible without the person taking up space nearby. I believe that one of the main tasks of a data scientist is to ensure that companies relying heavily on shallowness remain confined within their boundaries of effectiveness where they have the greatest likelihood of survival. At the same time, there is a great need to ensure that innovative companies dominate the market. We simply have to accept that these are different types of operations; these companies have unique needs. The search for greatness begins with one's clients. So there must be options that substantially enhance their competitiveness rather than add to their risks. In order for companies to become "innovative," they need to have access to and to start making use of deeper data. I regard such a transition to be hazardous. These are not simple or straightforward matters. In this blog I will only be considering the distinction between shallow and deep data, which has important business implications. If an organization chooses to deepen its data, this doesn't mean just changing the type of data collected but also how the company interacts with it. There would have to be changes to structural capital. So it is a management concern rather than just informational.
Fundamental Investing and Technical Trading - Useful Analogy
In the investment industry, there are many types of strategies. But I suggest that there are only two major "genres": types of technical trading versus types of fundamental investing. Technical trading goes by the numbers - trading prices and sometimes volume. There tends to be great use of trading patterns. As a general genre - although I recognize my perspective might not be widely shared - I consider any type of summary approach fairly technical. For instance, if I had an atlas of the world containing some basic economic stats for each country, and my basis of comparison were just a handful of indexes and ratios, this too seems rather technical. Scientific management as it was promoted by Taylor was highly technical. I used to enjoy listening to technical trading commentary in part because I found it entertaining. Particular trading formations - such as "head and shoulders" and moving-average crossovers - are said to hold special significance. I personally encountered two notable concepts: correction to historic averages (mitigating the pattern); and momentum buying or selling (triggering movements away from the average). Of particular interest to data scientist should be how this discipline emerged despite its limited use of anything but shallow data. I suspect - now after having watched many seasons of "24" on DVD - U.S. intelligence adjusts its terror alert level based on online chatter. Well, such a strategy is highly metrics-oriented. It makes use of what I would describe as shallow data. A keyboard logger can measure how many keystrokes a person makes in an hour: the assumption here is that a high number of keystrokes means a person is working hard; deeper yet is the assumption the company is better off if people appear to be busy typing on their keyboards.
On the other side of the spectrum, there is the idea of fundamental investing. Here the emphasis is not on timing, pace, or the metrics of engagement but rather the substance of decisions. Gone is the idea of growth through management of process; instead, the objective is to achieve strategic effect. A fundamental investor purchases and sells stocks just like a technical trader; but the process of "trading" is not meant to be the impetus of growth in the portfolio. Rather, the focus is on the specific selection of worthwhile assets. Some might describe this as "investing as an owner." There tends to be more of a "buy and hold" approach. One form is value investing which involves picking stocks that have been beaten badly down and bruised: the purchase isn't inspired by hopes of a correction per se but by the fact that the market has left for dead something that might still hold a lot of promise. Perhaps the classic fundamental approach involves recognizing potential in a company that others have missed: this idea is applicable to both start-ups and established companies whether beaten down or not. I suppose that the downside of fundamental investing is how it is more difficult to explain. It often requires considerable use of all sorts of data. Whereas prices and perhaps volume were fine for technical trading, suddenly there is a need for market analysis, audited financial reports, and solvency and cash flow studies. So in terms of the actual amount of data, fundamental investing uses an entirely different "level." It is deeper. If investors consider not just what the company has to say but also the perspectives of customers, competitors, consumers, and society, the data would indeed contain great depth.
In practice, it seems illogical for an investment company not to incorporate a bit of both approaches. Sometimes a company will decide to take a position through rational choice; but once committed, it remains possible to acquire positions through a technical strategy perhaps using an algorithm. Although it is true that professional speculators have attempted to make a living out of technical trading - sometimes trading in and out on the same day, leading to the term "day-trader" - I think it is far more common for a portfolio manager to take a strategic stance and then time the points of entry and exit. I am describing here a scenario involving two distinct levels: 1) the rational level involving substantive evaluative processes, deliberation, and reflection; and 2) the technical level that is unlikely to question the rational choice. Similarly in relation to data, there is the productive or operational side where the focus is on execution; this is rather similar to technical trading. Then there is the business side of data: there is an emphasis on making the right choices. The data must be deeper. I'm not aware of anybody else drawing a conceptual delineation between technical trading and fundamental investing based on "level" of the data from a structural standpoint. In any case, this is the gist of my argument. Different levels of data are required to support particular management approaches and decisions. Now, if we step back and listen attentively, I think it would be apparent that these levels of data occupy different patterns or routines in our lives. The technical level can be found in the quotidia anchoring our day-to-day behaviours. The fundamental level is not as loud or discernible but much deeper.
Simulated Sales Data
Although the distinction between something fundamental and technical perhaps makes sense in relation to selecting and managing equities, its applicability to other concerns might seem less coherent. I have generated some simulated sales data (data.zip) to help with the discussion. The structure of the "sales data" is similar to that which I normally handle in real life although I am generally focused on many more things besides the number of units sold. In this simulation, the company has twelve sales agents. The data from these agents populates a number of performance tables - one for each agent and also for the entire group as a whole. By the way, for those that lack any kind of formal in-house processing system - that is to say, a quantitative environment dedicated to handling data - I consider it possible to perform many data-intensive tasks without such a thing. My daily processes have become rather complicated since I rely mostly on spreadsheets, but in many cases this is adequate. In the simulated data, the agents sell a number of items. The strongest seller in the group is Carlie. The weakest sales agent is Ellie. As one examines a team of sales people, it is sometimes worthwhile to determine their relative strengths and weaknesses over a period of time. Not all sales people only handle sales or do so for the same number of hours or under the same conditions. For example, Carlie might make most of her sales in a call centre while Ellie does so from a retail outlet. Clearly I am framing this simulation in relation to quotidian activity. The data is rather technical or operational in nature.
A company might be interested in its superficial data: its technicals. In other cases, it might need to know how those numbers of emerged: its fundamentals. As I pointed out in my previous blog dealing with different metrics formats, the data that a company generates can have different structures. I am suggesting now that these structures influence the usefulness of data to achieve strategic business outcomes. In the discourse surrounding the emerging use of big data, I believe there is a bias towards the technicals, which I suggest has limited strategic value. It is possible to compile technicals from an organization that is losing money and that will continue to do so. A technical approach advances the cause of monitoring, but it offers limited guidance for future improvement.
In order to ascertain relative competitive strengths and weaknesses, I created a measurement that I call the "lead differential." I'm uncertain about its prevalence in the broader community: it is the sum of the differences between competitors for a particular product or service. The lead differential for Carlie would be obtained by subtracting her sales from all of the other agents: (Carlie - Bob) + (Carlie - Ellie) + (Carlie - Gerald) + (Carlie - Karen) and so forth. The total represents the extent to which Carlie's performance differs from the others. Using the lead differential represents a technical approach that doesn't take into account any kind of rational or intellectual process. The illustration below contains the profile for Carlie in relation to her sale of major items. The choppiness is due to changes in tempo. Sometimes Carlie has lots of sales - sometimes little. If we assume that Carlie indeed works in a call centre, she would have little control over the calls directed to her. Some component of Carlie's success as an agent is therefore "internal" - dependent on her skills - and "external" or connected to the market.
If we gather the lead differential over a long period of time, it will be possible to determine in general terms where an agent stands: for example, one would be able to say rather confidently that Carlie can't sell much more beyond 150; conversely, she probably can't sell less below -75. (These numbers do not represent the actual number of units sold.) Most data occupying a shallow depth characterized by fluctuating metrics can be handled in a similar manner. Although subject matters might differ, interestingly enough the patterns follow certain noticeable conventions. There are many similarities regardless of the exact source. Just to emphasize my point, I passed Carlie's lead differential data through an application that I originally designed to stocks. I later used the graphical environment to view a large variety of data: earthquakes, tidal levels, strong winds, and electro-cardiogram readings. The image appears in the next section.
Behavioural Patterns Beget Shallow Data
The application shown below uses what I call the "Storm" family of kinetic algorithms. Storm is able to successfully confine the boundaries of Carlie's lead differential for sales as indicated on the rightmost pattern. Storm algorithms are interesting in that the end-objective is to remove the impacts of magnitude. The algorithms are focused on the level of movement back and forth across the neutral-plane. The technicals allow for "some" level of prediction. What is unclear however is whether we are attempting to predict something internal to Carlie - perhaps an ability or skill - or some aspect of the external environmental that she occupies - such as the demand for her products. When I use the term external, it is understandable for readers to think about the market. However, some externalities might also be much closer: maybe a new information system recently set up in Carlie's company; an addition to the management team; a change is strategic position; new workloads and schedules; it could even be repairs to the ventilation system of the building. Using a technical approach, we have merely decided to omit the complexity in order to make the underlying goal of monitoring possible. The complexity merely cannot participate in the data. As such, one might say that a technical or operational perspective focuses on the internal - for example, inside the organization. I am talking about the patterns and rhythms of production. A fundamental or strategic perspective is more aligned with externalities. In this case, I mean the deeper continuum of the market and its environment.
Storm operates using only quantitative data. At no time does the program try to explain the underlying cause of the dynamics giving rise to the data. Therefore, any attempt to "predict" the future is premised on the persistence and stability of structures supporting equilibrium dynamics. In technical analysis on investments, there is the idea of the market correcting back to its historical averages. Similarly, it is sometimes said of a strong housing market, there is a bubble that will eventually burst and give rise to more historically consistent levels. These are all technical assertions that might later prove to be substantially incorrect. Issues of timing and pace are pervasive among technical analysts. I would almost say that the predominance of this perspective has caused data science to become synonymous with the technicals. There is indeed data, but it is shallow. I'm guessing that I can fit the trading data for every stock in every stock market in the world over the history of humanity on a single computer. Shallow data does not represent much data at all. Quite the opposite is true. Reducing society to some kind of estranged metric such as stock-market fluctuations is a great way to minimize the amount of data that must be handled.
It is worthwhile to consider the technical perspective because it reveals a world driven by small amounts of data. My emphasis here isn't the "small amounts of data" since this merely restates the concept of shallowness. I draw the attention of readers to the word "driven." For instance, a cruise missile cut off from any external feeds might have to deal with the data that it can access directly through its sensors. The missile might nonetheless achieve its primary objective. In a sales environment, the end result of any interaction is probably a sale (or non-sale); so here too it is possible to function without a large amount of supporting data. Indeed, it is unnecessary to have large amounts of data to successfully perform many day-to-day functions. The question in relation to sales is whether the primary function is the only function and whether the data should flow in a single direction. It is often worthwhile to know why people are buying a particular product; why they have chosen to buy now using this particular method; why they have chosen not to buy from a competitor. The answers to these questions do not affect the primarily function but the business circumstances that brought about the function. Using a highly technical perspective, there is no framework on which to secure the deeper data. Carlie for instance might be forced to set aside or ignore useful information since she has no means of converting it into persistent data structures. The efficiency gained from shallowness can jeopardize the continuance of the organization.
To be "driven" by shallow data means staying on course possibly even if that course is wrong; making small corrections to remain on course; conforming to the business model as it exists. We therefore face a situation where - even if the business model has changed or there is some desire to change it - the shallowness of the data might remain in place for some time. It serves to reinforce the status quo. The way in which people interact with data is part of the structural capital of an organization; it is designed to promote continuity. Even the most radical board room decisions and upheavals resulting of mergers might barely be felt at the quotidian level. This is not to say that people deliberately want to prevent change. It is just that, small amounts of data are required to achieve primary functionality: there are only so many different ways to sell a bag of rice. So to the extent that the business model is premised on the shallowness of data, there are limitations on how the model can be changed in relation to that data. Let us say for example that a company wants to double its sales next year by emphasizing a particular product line. Perhaps the company wants to open 10,000 new accounts by next month. So these are metrics, targets, or benchmarks. It doesn't take much thought to set lofty goals. I guess it's quite a big topic in itself - how to achieve ambitious targets. Well, a technical approach doesn't address the question of "how": the implicit response is simply "to do more of the same" perhaps only with the slightest variations in methodology.
Detecting Externalities for Deeper Meaning
An index or metrics based on "sums" rather than "fluctuations" is characterized by having no distribution below 0. The pattern only heads up although the slopes might change. I introduced this illustration in my last blog, where I revealed that I usually call it a "plough chart." (I guess Americans might call it a "plow chart.") On the plough chart presented below, the index is based on the sum of Carlie's lead differentials (the sum of all of those fluctuations in the previous illustration). This chart indicates that there is a mysterious dip near the middle of the pattern. I added some slope lines just to emphasize the dip. I hope readers can see it. As I mentioned earlier, Carlie is among a team of sales agents all with different performance numbers. Dips in performance might have nothing to do with Carlie herself but rather the market exercising external influence over the sales data. Because the lead differential is a measurement of competition, it is not possible to determine if the dip is due to Carlie's relatively poorer performance or some aspect of the external environment. However, since I wrote the code (Production.java), I can say without hesitation that the decline in sales is due to an external competitor entering the market. During the middle of the sampling period, I introduced the competitor, which had the effect of bringing down the Carlie's lead differential.
I feel that a plough chart offers certain benefits over fluctuations: at least the analyst is made aware of the change in slope. Ploughing serves to remove the impacts of tempo where there might be momentary and temporary points of compression or expansion in sales (tempo robatto). It is my experience that even extreme choppiness in sales can lead to coherent plough lines with stable slopes. I therefore suggest that an index based on sums is inherently less technical than fluctuations; or stated differently, there is a greater focus on fundamentals. The chart seems to suggest that Carlie is losing ground in relation to the other sales agents; but this is not the case. In order to confirm that an external force is adversely affecting sales, I posted the plough chart for Ellie, the worst-performing agent. This time rather than using the lead differential, I decided to show cumulative sales in terms of units sold. I hope that readers notice how even with Ellie, not taking internal competition into account, her sales suffered during the same time period.
Although the data is not in itself any deeper, plough lines allow for greater depth if one also chooses to take into account external developments. The same can be said in relation to fluctuations. But associating external events to fluctuations can lead to faulty attribution due to the unstable nature of data at this level. While trends in fluctuations might be sensitive to seasonal shifts, the overall demand for products might not be indicated. Shallow data would probably show abrupt changes in purchases, and this can point to the sudden need to adjust staffing and inventory levels. However, the decision of whether or not to keep a product should be based on the overall demand as indicated by market-level analysis. It would irrational to rely on fluctuations since these tell us little about the market.
When Taylor was studying how well workers shoveled coal and gravel, perhaps he didn't consider the possibility that people shouldn't be shoveling coal and gravel. Deciding to find alternatives to human shoveling is a business decision; it is a rational undertaking that considers the fundamentals. Getting people to be more productive by shoveling greater amounts involves mostly technical issues: the shape of the shovel; the ideal amount of coal or gravel to fit in the shovel; the best posture to hold the shovel; perhaps even the most appropriate breathing. I am not dismissing the importance of any of these things. However, we sometimes find companies out of rhythm with their surrounding environments. An organization might be rhythmically sound in terms of successfully carrying out its predefined tasks. Yet this internal music might disagree with the patterns of interaction required to survive external developments. There is actually a large body of business literature dealing with this topic, but usually it is characterized as the trade-off between "efficiency" and "effectiveness." It has been observed that highly efficient operations seem to become less effective over time. Consider how this might be possible through the data. In order for an organization to become estranged from its market, I would argue that it is necessary to systematically remove or ignore aspects of effectiveness from the data. So the metrics would have to become more operational (focused on operations) than business oriented. I therefore suggest that a preoccupation with operations can lead to a systemic entrenchment of efficient but potentially ineffective behaviours.
Using History to Understand the Deeper Music
Large data samples occupying significant periods of time can exist for shallow data. Time itself does not lead to a deepening of data. If our understanding is advanced, it would be in a purely technical sense: e.g. the "head and shoulders" formation mentioned earlier. Time is not the same as history. Time is required for history to unfold. But in history is the idea of relevance. From the standpoint of books or written history, there are often debates about the accuracy of history for example in relation to Christopher Columbus. Being Canadian, I'm still under the impression that we won the war of 1812 against the United States. There are people trying to shape our perceptions of reality by dictating history. Once and awhile, somebody like President George Bush might go on an aircraft carrier and declare total victory in Iraq hardly at any cost to tax-payers and with no loss of American life. So these people get it. History is not about events occupying time but the perspectives of those affected - living in the time. Shallow data is temporally connected. Deep data extends from the phenomena of people living in history. For instance, the "Baby boom" is not a metric of production per se - although true enough there were lots of babies - but it was a social development. We confront a continuum related the level of data as indicated below.
This blog contains disciplines not normally associated with data science: history and music. So I will restate these disciplines. Imagine entering the digital archives of a country that is totally unfamiliar and, for the sake of argument, disconnected from the rest of the world. We would like to learn about the people. Where would we start? Basic stats can be summarized in a small booklet. To learn about the "people," we would need to check their history. From a functional standpoint, this history isn't the data itself but our means of accessing it. A person doesn't just choose a year such as 1993 or 2001 and start skimming through the data available. Access is partitioned by social relevance. The data is socially relevant. How it became regarded as such takes some thought and probably requires a much longer blog. When we access through history, we are not looking just for any data. We are really searching for some sort of "production" (a musical production - but not necessarily like a stage musical.) Without a production, it would be necessary to piece together unfamiliar details - like putting together a jigsaw puzzle - without a picture to go by. From an organizational standpoint, we find that companies often repeat mistakes, indicating a disassociation maybe not with data but data posed historically. When data is objectified and complex, it is necessary to access by relevance (history) and render by composition (music). Of course, I chose music because it fits the idea of a serial data feed absent from, say, drawn artwork or stained-glass windows; but I certainly don't dismiss the importance of other forms of expression to help guide us on our quest.
I just want to close off with something people can actually do or use because I realize the blog is a little esoteric. Consider the idea of noting interesting events as they take place. For instance, a person might add the following annotation to a row of data: "Christmas special starts today for 5 days." If sales increase over the following days, it seems reasonable to attribute some portion of the increase to the special. Alternatively, in order to preserve the integrity of the data files, the annotations could be added to separate files perhaps as regular text. Some major divisions to consider include internal (organizational) and external (market) history. An organization can retain "history" to help it explain its metrics. History can be connected to the slope or velocity of a plough curve. So if I later investigate a particular velocity such as 100 units sold per day, I can determine the organizational setting in which this production level occurred. On the prototype that I use, "history" is broken down into symbolic "events." These events are distributed to specific metrics using a process that I describe as "mass data assignment." As I point out in this blog, without using any special software beyond maybe a spreadsheet and text editor, it is possible to gather large amounts of data for all the right conceptual reasons. The need for dedicated technology relates to the challenges posed by large amounts of data if results are needed as quickly as possible. Through the inclusion of history, we become sensitive to the rhythms and beats pulsing around us. The data becomes structurally inclusive. The world participates in the data.