Subscribe to DSC Newsletter

We now live in a data-immersed society. What used to be a term that was mostly the domain of folks in white lab coats is now thrown around by just about everyone--salespeople, soccer players, surfers, you name it. “How much data do you get in your plan?” “Do you get unlimited data?” So the burning question is, what is data?

Data is basically just raw information about something--anything--in some form that allows it to be captured and stored. That could be anything from the massive files stored on AWS servers to the Dead Sea Scrolls sitting in clay jars. Furthermore, what is considered “data” may be highly subjective. You may not consider a chimpanzee splashing paint on a canvas to be data, but a primatologist just might.

With that said, data does, for the most part, fall into categories that are useful for business folks, educators, IT and data scientists alike. First, let’s look at data from the perspective of those tasked with analyzing it, who tend to look at data as either numeric or categorical.

Numeric, categorical and time-series

Numeric data is pretty much what it sounds like--numbers that represent measurements or values.  So if you’re building a data table on the housing in U.S. cities, the price of a house would of course be numeric, as would square footage. Numeric data is typically continuous, meaning that it can fall just about anywhere within some given range that lies within the natural limits of what you’re measuring (you’re unlikely to find a house that costs a trillion dollars). It can also be ‘discrete’ if there’s some very specific range--like the number of members in a family. In fancy scientific terms, this is also called “quantitative” data because it describes a quantity of something.

Other data is considered categoric, in that it ascribes an item or event to one of few different categories. For example, ethnicity, sex, eye color, would all be considered categoric data points. This is sometimes called “qualitative” data because it describes a quality.

A third kind of data is time-series data, which involves a time--i.e. 11:33:32 AM, Dec. 14, 1968--and some kind of value, such as blood pressure, the speed of a car, the amount of sunshine or rainfall, and so forth. This gets a little murky, because time-series data is clearly numeric in nature--perhaps it’s best to think of it as a special type of numeric data. But, time-series data is becoming extremely important now because of the Internet of Things. When you hear about “data coming in from sensors” it’s almost always time-series in nature.

Structured vs. Unstructured

Sometimes we think about data in terms of how it is organized, as is the case with structured and unstructured data. Structured data is more of what you’d traditionally think of as data--organized in a data table or spreadsheet, typically in columns and rows. Unstructured data, on the other hand, often isn’t so easy to organize, and can include a wide range of things from images to emails to an mp3 of a phone message. This too gets a little murky, as sometimes unstructured data can actually be organized in a structured manner--emails, for example, could be formatted to a table according to time sent, sender, etc.

Big, bigger, and biggest data

Of course, no discussion of data would be complete without talking about “Big Data.” As the term refers to amounts of data, and not the type, Big Data can come in just about any form, and the only qualifier is that there needs to be a lot of it. It can be structured, coming from enterprise systems like ERP, or it can be unstructured, such as photographs, CAD files, and social media posts (which are actually HUGE contributors to the Big Data phenomenon). Unstructured data, in fact, may make up one of the fastest growing categories of data, which isn’t too surprising considering that the number of channels that people communicate through is proliferating rapidly. Time-series data is also a major contributor to the mountain of Big Data that companies are grappling with, as many IoT systems take readings in sub-second intervals from massive networks of thousands of sensors--it adds up quickly!

Big Data has created a unique set of challenges in terms of processing, storage and retrieval. As big data requires big storage and also may be rapidly collected, most organizations find it difficult to maintain it in an orderly fashion. In fact, there’s an entire category called “Dark Data” that essentially describes big data that you’ve stored somewhere and can’t find. But it also presents a major opportunity in terms of analytics. For example, many of the algorithms used for prediction in business, medicine, you name it, gain accuracy with access to larger data sets. And, we may find that there are certain questions that can only be answered when massive amounts of data are analyzed.

Of course, regardless of the form of the data, if it’s stored on a computer, it’s converted into “bits” of 1’s and zeros. Eight bits make a “byte”, so when your friend talks about a GB of data on their cell phone, you can impress them by telling them that they’re actually talking about a collection of about 8 billion 1s and zeros (use your discretion of course). Amazingly, those 1s and zeros can be combined in such complicated ways that they can represent just about anything that human beings can dream up--everything from an Excel spreadsheet to the special effects in the latest Star Wars movie.


Views: 3541

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service