I have been using the term "mass data assignment" in my blogs. I thought I should offer the community some simulated examples. These are simple simulations: all the data is in one place in an agreeable format. The file contents are meant to be easy to peruse. When I was younger, there was a television series called "Stargate SG-1." I have a number of seasons on DVD. In this series, a special branch of the U.S. Air Force visits offworld sites using stable wormholes: teams enter the wormholes through large circular gates called stargates. In my data attachments and also in the television series, there are four regular members of SG-1: Jack, Daniel, Samantha, and the alien Teal'c. Although he didn't normally participate in offworld missions, I also added General Hammond to team. I wasn't thinking about gender parity during the generation of the data; so I certainly apologize for the over-representation of men. In these simulations, an annoying NID performance expert has been assigned to score the outcomes of the offworld missions. To this end, she believes that performance is intrinsic to the person: in order to determine the best performers, she forces General Hammond to randomly assign members of the team to offworld missions. There are two simulations each containing data from 1,000 missions. There are controls: eating green jello prior to a mission improves performance; eating red jello reduces performance. The contributing impact of eating jello to performance is not taken into account by the expert, leaving me as an outside consultant to review the data. In the simulation #1 (sim.green.zip), only the green jello has active properties. In simulation #2 (sim.greenred.zip), both green and red jello affect performance.

Although the parameters of the scenario are fairly interesting, I will be using this blog to discuss the setting or circumstances surrounding mass data assignments. I point out the most important setting: jello. Apart from the selection of offworld participants, the only other type of data included in the simulations is choice of jello. However, all sorts of data can be assigned en masse to the performance metrics. I could for example include the following: number of hours slept; weight; amount of caffeine consumed; type of soap used for bathing; type of pajamas worn; indoor carbon dioxide levels. Similarly for an organization, all sorts of operational data can be assigned to performance metrics or attribution schemes. The clincher I think is the idea of assignment "en masse." This means that a person doesn't really have to give the matter of inclusion or exclusion much thought. There is less need for a-priorist deliberation. We don't know in advance what is important, this is the general concept. So just include everything. The exclusion of facts premised on the absence of knowledge is illogical, unscientific, and quasi-intellectual. In these simulations, jello is critical to performance. (The SG-1 team seems to eat lots of jello.) Researchers should not systematically go through every possibility under the sun testing hypotheses individually. Using mass data, a resplendent feast of numbers and qualitative events awaits further exploit.


"Tendril" is the development name of the first prototype that I created to handle mass data assignments. (The Tendril in my blogs should not be confused with any commercial products on the market having the same name.) Tendril is a "research prototype" intended to increase understanding of methods rather than get much done. The tables appearing on this blog are from Tendril's successor: its development name is "Elmira." This is the first time I have mentioned Elmira on a blog. Both of these applications are written in the Java programming language. Although these programs achieve similar outcomes, Elmira uses completely different methods. Elmira supports elaborate attribution regimes: rather than make use of an obscure metric such as "performance," as per the NID expert, the attribution can be much more structured. It can be made up of specific military protocols, settings, geographic zones, encryption codes, frequencies, and mission objectives.

Elmira, which by the way is a pleasant small town in the Kitchener-Waterloo region, in Ontario, is not meant to research methods; but rather it gets things done. I don't expect to develop Elmira itself much in the future, but I intend to use it on all sorts of data. I am comfortable keeping the methods static at this point and focusing on the accumulation of post-development intellectual capital.

Data Versus Attributes

Data can be mass assigned to an attribute. Attributes are meant to be modeled or structured. The data can be unstructured. "Performance" is an attribute - not inherent but assigned. The production level associated with performance is data. Ideally, data is something we can control; but we want to know how to control it; it is antecedent. The attribute is something we would like to influence through the control of data; it is outcome. Consequently, in order to achieve a higher level of performance, it should be apparent what data to target and in what manner. I have already explained the controls in the simulation. I could tell the SG-1 team, "It doesn't really matter who gets sent offworld. It seems you should eat more green jello in order to appease the NID bean counters." That's my Jack O'Neill impression. Let's consider the proof. The table below was generated by Elmira, but the results are presented using Excel. The attributes are on the left: Terrible, Bad, Fair, Good, and Great. The event data is noted in square brackets along the top: Red, Green, Blue, Purple, and Yellow in regards to the jello; Samantha, Jack, Daniel, Teal'c, and Hammond for the deployment team. The screenshot will show that performance is highly correlated between "treated" (SG-1 member joins) and "untreated" (SG-1 member sits out) for each member of the team, indicating that treatment (inclusion of specific members) has little impact. This does not mean that their contributions are irrelevant but rather that the specific choice of members in the deployment doesn't seem to affect performance. On the other hand, green jello seems to have a big influence.

The above image should show the relationship between the data and its attributes reasonably well. I didn't have time to replace the file paths with aliases: one\sim\traits\great.txt does indeed mean "Great." I will make the upgrades to Elmira later. It might take a bit of thought deciding what should be an attribute and what to include as data. For example, if I were concerned about stress, I could include blood pressure as an attribute. At the same time, blood pressure might also affect performance; this means that it could legitimately be included as data. It can be both attribute and data. The characterization need not be the same - or it could be. However, if the characterization is identical, I would expect the data to coincide with its attributional counterpart exactly. I used this technique as a form of diagnostics on Tendril in order to confirm that the data has been processed properly.

Crosswave Differential

Using the correlation is not ideal. By this I mean that it doesn't to seem to work most of the time! It works conceptually. Sadly the math won't cooperate. Due to my non-statistical background, I can't exactly say when those times are. I notice faulty correlations when both red and green jello are active. For mass data tabulations, I instead use what I consider to be a simpler and more reliable method that I call the "crosswave differential." I'm unsure if I am the first person to make use of this approach; but I admit that I developed it independently. In order to demonstrate how the crosswave differential works, I will return to the previous example of green jello and Jack - having poor and high correlation respectively. The crosswave for green jello will show a differential between treated and untreated as indicated by the downward-pointing arrows; the presence of differential means that the green jello makes a difference. However, the crosswaves for treated and untreated practically occupy the same locations for Jack, meaning that there is almost no differential. Green jello should make a difference as per the control. Jack's specific inclusion offworld shouldn't matter.

I hope the math is pretty self-explanatory in terms of how I came to the differential series. For example, here is the series for treated green jello: 0=0; 93=0+93; 187=0+93+94; 273=0+93+94+86 . . . the reverse on the other side. Does this approach look familiar? I wrote an entire blog on plough patterns. A crosswave is an intersection of plough patterns. Each performance level is qualitative in nature. So treating performance in a purely mathematical manner ignores the fundamental fact that it isn't mathematical at all; but it is certainly hierarchical. If we consider two incidents of "fair," they might occupy the same qualitative boundaries; but this does not mean that they are the same quantitatively. I'm trying to think of a good example from the plough patterns. If a company maintains monthly stats, an event occurring during a month might happen at the beginning of the month in incident near the end in another. They coincide in terms of periodicity - the conceptual or imposed boundaries - but not genuinely in relation to mathematical placement. In any event, this approach works really well. Just before leaving this discussion, I point out that qualitative boundaries can be "externally defined," for example by a performance expert, or it can be "internally extended" in the case of phenomenology.

In the simulation for red and green jello, purple jello and Samantha should at least conceptually present high correlations. Samantha's inclusion or exclusion should not affect the score. The consumption or non-consumption of purple jello should likewise have no impact. In other words, neither purple jello nor Samantha's inclusion should matter. However, the correlation both for Samantha and purple jello is quite poor, which indicates that treatment influences performance. Since the data is controlled, I know for a fact that correlation is not providing useful guidance. However, the lack of bias is generally confirmed using the crosswave differential approach as shown below: both purple jello and Samantha exhibit minimal differential. The crosswave differential was developed for Tendril. I will be adding the feature to Elmira later. However, I mapped out the crosswaves below manually using Excel on the output generated by Elmira. The crosswaves are indicated under T+/T- for treated and U+/U- for untreated.

Question: Although the x-axis clearly contains the score, what is along the y-axis? Answer: The y-axis contains the number of event incidents - e.g. the total number of times "Samantha" occurred at a score of "Good" or better.

Hypothesis Bias

The NID performance expert exhibits a-priorist data exclusion. She is focused primarily on the selection of team members to determine performance. She believes that performance is something intrinsic: it is the result of involvement of particular individuals. I acknowledge how individual performance in real life is likely to be reflected in the mission performance metrics. Green jello would probably be ignored. Although I added jello to the event data deliberately, I would expect discoveries to occur incidentally or accidentally. Similarly, an organization might blame poor performance on individual workers rather than the following: lack of quality control; unreliable systems; inadequate resources; poor air quality; inadequate training; faulty performance metrics. Poor lighting and inadequate safety can affect people working at night. Apart from jello, performance during a military operation seems likely to be influenced by the equipment, supplies, communications, and stability of command.

A performance or efficiency expert stepping into unfamiliar surroundings seems likely to select areas of concern that might not affect performance much at all. She might regurgitate insights from business school case studies. She might have no choice but to do so because the primitive nature of her analytics combined with the organization's prehistoric data-collection methods. If she attempts to use a more traditional scientific approach in order to support "evidence-based decision-making," there would likely be quite a lag between hypotheses and actionable insights. On the other hand, testing hypotheses by mining field data might not lead to conclusions that are particularly supportable from the evidence available. In short, I believe that there tends to be a high risk of bias; it is present to guide the process towards a conclusion that is marketable. The process can be scientific in form but not necessarily in substance. I believe that a mass data approach is rather unscientific in form - at first glance - but it is scientific in substance.

Focusing on Event Deployment and Attributional Modeling

I use the jello example to emphasize that the data can be anything. An organization can throw all sorts of events. This does not mean that the events will be found relevant. This approach is quite different from the idea of collecting just the right kind of data - or having a clear reason to collect particular data. It is important to show some discipline when throwing events to prevent the processing system from being incapacitated; but the lengthy deliberation and rationalism that might go towards the screening-out or prescreening of data - a prior, on the absence of data - can be greatly reduced. Apart from the deployment of events, some effort should be made to develop attribution models - that is to say, the objects receiving assignments. In the case of the NID expert, the attribution model is a simple scale: Terrible, Bad, Fair, Good, and Great. This is the type of simple progression that Tendril is designed to support. However, the attribution scheme may be much more sophisticated on an application like Elmira.

Consider the coordinated attacks in Paris on Friday the 13th (2015-11-13). Piecing together sequences of events leading up to the attacks is not just an exercise in investigation but also data inclusion and exclusion - questioning the ontological basis of security that resulted in recognition failure. This is a challenging situation that cannot simply be overpowered by say a supercomputer. Furthering the challenge is the complexity of selection and prioritization. Although I don't claim that Elmira can solve this type of problem in its current state of development, I point to the usefulness of being able to simply canvas all of the data - or at least to have a framework to proceed in this direction - since going after individual leads is likely to miss the truth. I used an analogy on a past blog once: it is tempting in data science to build a landing strip and then expecting the enemy to land on it. When truth is defined by the researcher than the phenomena, he or she is likely to extract alienated data - that is, disassociated from reality. When an attack presents itself as a masterful orchestration, on one hand this is an extreme insult that I'm sure the French of all people can appreciate; but I guess more metaphorically, it means that the planes in fact will not be landing on the landing strip.

Mass Fractal Indexing of Event Differentials

In several past blogs, I wrote about the Storm family of kinetic algorithms that I developed many years ago. I was thinking one day, why not examine "event differentials" kinetically? An event differential is the subtraction of past events from more recent events involving two mass data files - past and recent. This is all pure research at this point. Since I am not aware of anybody else on the planet carrying out "subtraction" of mass data objects, I developed my own approach. The Storm algorithms can be sequenced in different ways. In the image below, I use the "reluctant" sequence. Reluctant adheres to the following differential pattern: 2-1, 3-2, 4-3, 5-4, 6-5 and so forth. Don't confuse 2 with the number 2 or 3 with the number 3. Think of the series as incidents rather than quantities. In the upper image, each time modulus 10 of the index (numbering 0 to 999) equals 0, the NID mission score is increased. Moreover, eating green jello results in an increase. In contrast, on the lower image, only eating green jello causes an increase as per the first example earlier in the blog.

The reluctant indexing caused the crosswaves to separate vertically. Moreover, there has been some right-shifting of the crosswave differential. There are all sorts of possibilities using temporal fractalization. In principle, an event differential "should" promote the differences in events. This means that if a number of events occur on day 2 that didn't occur on day 1, only those different events should be tabulated by the compiler. Therefore, the tabulation should be able to focus on sudden shifts. It is an entirely new application for me using kinetics to examine qualitative events. By the way, for those interested in the types of images generated by a reluctant type "container," below I use Storm's imaging system to show how earthquake data from Western Canada is transformed by the sequence. So just imagine the rather novel idea of yanking out this container designed for a quantitative data stream and pouring into it qualitative differentials. Sadly, if I fill the container with multifarious events rather than a quantitative data stream, I don't expect to be able to generate plumes. Not that plumes would be all that meaningful in relation to qualitative events, anyways.

Twilight of My Programming

After 40 years of programming, I decided to make Elmira my final programming effort. I will be discarding a number of projects that I had on the backburner. This is also my final blog, I regret to mention. I still expect to do lots of writing perhaps to complete more substantial outcomes such as books. I might also try to reach a different audience. I'm searching for a diverse crowd - people open to the possibility that jello might indeed affect performance. These are not necessarily scientists; they can be open-minded dreamers. I don't have any long-term development plans for Elmira. So what has been presented in this blog likely reflects Elmira's future condition for years to come. I never mentioned this in the past, but I blog in order to help me make sense of my projects. I externalize the thinking process. If I don't expect to do anymore projects, then there is much less need for me to blog. Similarly, I find it difficult to do a project without blogging about it. When I tried to stop blogging a few months ago, I was thwarted by my continued programming efforts; as I habitually started blogging about them.

The cessation of blogging is therefore unrelated to any change in desire to blog; but rather it is due to the absence of programming. In terms of why I plan to cut my coding endeavours, it is related to physical stress and strain more than anything else. A person would have to debug code for many hours a day either as a hobby or professionally - day after day, year after year - to understand the physical costs. Having studied the experiences of those with repetitive strain disorders, I know that the damage can appear many years later. I am being proactive. Finally I am at the end of the blog. I feel really privileged having spent my last year of programming in an open setting so frequently reflecting my interests. I expect to work on Elmira's coding a few hours a month; so this is not a total departure for me. But it is nearly so. My programming will be so diminished that I can essentially say that it has come to an end. I am going to eat all sorts of chocolate bars tonight and reflect on a job well done - by any human standard - possibly even extraterrestrial alien standard.

Views: 806

Tags: algorithm, assignments, attributional, catching, comparisons, correlation, crosswave, data, delta, differential, More…distribution, elmira, event, examples, external, fractal, indexing, internal, mass, measurement, methodologies, metrics, modeling, modelling, multivariant, performance, reluctant, samples, simulated, simulations, stargate, storm, syscatch, tendril, throwing, treatment


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service