Home » Uncategorized

Can Lack of Data Always Provide Valuable Insights?

Can Lack of Data Always Provide Valuable Insights?

Prashanth H Southekal and Matthew Joyce

Today, data – both structured and unstructured, is seen as the most valuable business asset to solve problems and improve productivity.  An article in Forbes says every company today is a data company! However, we often get questions from our clients, whether data can offer insights when there is no complete data available.  The short answer is YES; data can provide insights even when the complete data set is NOT available! Here is a one example to illustrate this.

Crude oil is generally transported to the refineries through steel pipes. Any rupture in these pipes will have adverse consequences not only to the oil company, but also to the community and the environment. So, oil companies take extreme care and precaution while transporting oil to ensure that crude oil transportation via pipelines remains the safest and cheapest mode of oil transportation. One such mechanism to ensure that oil pipes don’t rupture is applying predictive analytics and machine learning (ML) techniques on the available data and taking preventive corrective actions.

Let’s say there is a hypothetical regression model which predicts the rupture of oil pipes based on 3 independent variables – operating pressure, pipe corrosion levels, and the amount and the quality of soil where the pipe resides. Typically, the data on operating pressure and pipe corrosion levels is easily available; the operating pressure details are provided by the pump and the pipe corrosion levels are provided by the PIGs (Pipe Inspection Gauges). However, the data on the soil around the pipe might not be easy to acquire as weather conditions like landslides, soil erosion, humidity, etc. affect the quality and quantity of the soil where the pipe resides. Hence of the 3 independent variables, the data is available only on 2 variables – operating pressure and pipe corrosion levels. So, when the regression analysis is performed to predict the pipe rupture using data only from the 2 variables, the adjusted R-square value and the P-value will show poor association as the “Soil” variable data, which is statistically significant, is missing in the regression model.

In simple words, the “Soil” variable is indeed a significant, independent variable, which is needed in order to get good insights from the regression model. Given that the data on the soil is missing, not all the pertinent data is available to do the predictive analytics on pipeline rupture. The converse is also true; if data is collected for regression analysis that is of no use in predicting the pipe rupture (due to high P-value), it is better to stop gathering that data. A study by Forrester says that 73% of data collected in an enterprise in actually never used!

Knowing that you don’t have all the available data to take an action itself is an insight. Good insights provide the right direction – in this example the next step in the right direction is exploring new data or finding proxy data to simulate the conditions that closely models the physical environment. This is exactly the reason, why data can provide you insights when physics problems typically don’t. Data Analytics problems start with a hypothesis unlike the physics problems which are deterministic and where the outcome is known. Despite not having the complete data, data can still be a valuable resource for the organization!  

*************************************************************************************************************************

Dr. Prashanth H Southekal is the Managing Principal of DBP-Institute (www.dbp-institute.com), an Enterprise Data Analytics firm. DBP-Institute has helped numerous clients derive value from data and technology. He brings over 20 years of Data and Analytics Management experience from companies such as SAP AG, Shell, Apple, P&G, and General Electric working on SAP Solutions, Data Analytics, & Solution/Data Architecture. Prashanth is the adjunct faculty of Data Analytics at University of Calgary (UoC) and sits on the Advisory board of DocAuthority and Grihasoft. He is the author of the book – Data for Business Performance (DBP) and he is currently working on his next book on Enterprise Analytics.

Matt Joyce is a Solution Specialist at SAS Institute (www.sas.com), the largest independent vendor in the business intelligence market. Through innovative solutions, SAS helps customers at more than 70,000 sites improve performance and deliver value by giving customers THE POWER TO KNOW®. Matt has an MA in Economics and specializes in solving business problems using advanced analytics and designing AI solutions across Western Canada. He has delivered numerous value assessment programs to various clients to maximize the use of SAS solutions and services that address customer needs.