Subscribe to DSC Newsletter

The old world original genus names for fruit trees were all named after women. Representing an implicit importance of relation a fruit bearing tree and its fruit to the human condition. A simple and elegant description between human perception and the reality of living.  Now we take liberty with fruit as it is available from all over the world any time of year. To look at a lovely ripe peach on a wonderful late summer day and to taste its wonderful essence over shadows the long rich story of survival. For this peach has arrived to you from many generations of trees and cultivated for your enjoyment for humans by humans. This is the essence of wild data.  

Taxonomy or classification of plants has been in flux and full of disagreement for thousands of years. For the fruit bearing trees and all plants in general are now divided into complex conflicting incomplete taxonomies. The newer taxonomies based on DNA and region of origin, seed type, or some attribute and are still incomplete and maybe always will be incomplete. The taxonomy still does not describe why some trees survive and others perish. What is important here is that the classification is just that a classification. This alone is significant because the classification may change yet the plant continues on with its search for an environment to bear fruit. Some years more fruit is produced than other years. The overall fact that seeds are produced wrapped with what will help the tree survive. Sometimes in wild data it is hard to tell if you are looking at the seed or the fruit or a combination of both.

The low hanging fruit is sometimes the most vulnerable and important to finding a story. Regardless of how a plant is classified in reality fruit trees have survived to bring forth fruit. This simply means that all the right circumstances have come together to make fruit. The bees were there to pollinate, the male tree survived in the right place so the wind could ensure pollination if the bees were not available. Water, soil and sun all contain the right qualities to make the fruit produce the fruit. In wild data how pollination happens may have an influence on how much fruit is produced yet it still does not tell you that the orchard was pruned, picked and seeds distributed to other orchards by a guy named Johnny Appleseed 100 years ago.

So metaphor can be offensive to some people yet in this case it is important because there is typically a purpose to scrubbing of the data. Overall it really ends up being more about interpretation of the results. Human cycles typically have a beginning middle or end or to be more specific there are threads or recurring cycles that show up that are not necessarily linear. This also means all the math in the world still may not bring meaning or interest to your work. In some projects the beginning piece may be available or an end piece yet no middle piece. Then again once you get the analysis has started the whole question may change.

The analysis process always starts with a question or a search for a story or justification of a story. In real data situations the data in question always has a partial context and typically has to be mixed with some other data that also has a context to answer a question or find a story.  It is important to ask good questions and to make sure that the questions have a basis in a human view of the world or the real world. The statistics or math side may still matter yet may also influence in the negative or say this is impossible and there are no answers. Although to give up too soon you may miss the fruit at the top of the tree. It is human nature that someone will pick that fruit if you do not. 

For example in the mixing of 100 years of UFO encounters with close to a 100 years of missing person data. The original question was Is there any correlation between UFO encounters and missing persons. This led to a refined question of, Is there any missing persons in the same location as UFO sightings in and around the same time as these sightings. What it took to do that was seemingly simple task was more than a trivial exercise.  With just one example of the world wide data set on UFO encounters was converting the sighting times to a consistent value. The task was unbelievably difficult because at first pass almost half of the 80,000 UFO records were not useable because of the time duration. The other task was the geolocation of the records which also had very high error margins. An error margin in this context is a whole record that could be converted to something useful. Useful in this case is a valid start time, duration time and location for UFO records and a valid date missing and location for missing persons. 

So after the first pass the error margins were important for a quality outcome. Realistically under ten percent error margin was desirable. So how do you get from a large error margin down to a small margin without spending a year hand correcting the data. The exercise is left up to the reader. The clue is solved like we solve everything else with human ingenuity and fortitude. I failed three times to convert the data. Each attempt the result set or the code became more polluted and it all pointed to something very simple in the end. Take the low hanging fruit, then get the ladder and circle the tree look at what you have left and process. Then on to the next tree until the whole orchard is picked and leave what is on the ground and unusable. 

The complete database, result set, and selects is on for verification. There is also a complete write up on “Is it a bird, a plane or superman.”  Most of the tools used were standard Unix tools like bash, awk, perl, python and shell scripts. Some of these tools have their limitations so keeping each task/step separate and simple was better because each could be modified moving the result forward with each iteration. When the iterations of the data spit out an error margin below ten percent there was a little fiddling to see if there was a lower level. Then finally, the left over ten percent of the records were examined and manually corrected. Then the process of indexing and selecting the matching data sets became priority. Ending with some interesting results. By shear numbers there is no correlation yet the result set left more interesting questions than answers.  Now that’s wild data. 

Views: 409


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service