In the first part here, I discussed missing, outdated and unobserved data, data that is costly to produce, as well as dirty, unbalanced and unstructured data. This second part deals with biased, inconsistent, siloed, too big or fast flowing data. I also cover security/privacy, data leakage and precision issues. Then, I address the case when features are too numerous (wide data) and issues related to high dimensional data. To not miss these future articles, sign-up to receive Data Science Central newsletter, here.
Inconsistencies arise when blending data from multiple sources. A specific field may have the same name in two data sets, but may be measured differently. Or your own data set or data collection changed over time: a metric counting total users now include international users. It is always a good idea to keep a log of all data and measurement changes, and then match these changes with impact on key performance indicators. Actually, my last job at Microsoft consisted in doing just that. I had to detect change points in time series, then match them against various events. It was a blind test, as I did not know the events in question until I completed my analysis. Eventually, the change point algorithm run automatically in production mode, every week.
As to reconciliate seemingly inconsistent data sets, there are various methods. My patent “Preservation of Scores of the Quality of Traffic to Network Sites across Clients and over Time” addresses this issue.
In one infamous Kaggle competition, then winner used the “Hospital ID” feature to predict with incredible accuracy the patients most likely to end up with cancer. These IDs were encrypted, yet patients with the most severe conditions were always sent to the same hospitals. Encrypting the IDs did not help. This is what data leakage is: some artifact in your data set allows you to make good predictions, but they have no real predictive meaning. Automated ML could do it as well, not just human beings. Imagine if suddenly the worst patients go to different hospitals. Then your fantastic predictive model will completely fail.
One way to address this issue is to use synthetic data. Or better, a blend of synthetic and real data, known as augmented data. Synthetic data would untangle Hospital IDs from case severity in this example.
This is when you have more features than observations. Your may also have few observations, as in clinical trials (a problem that again, can be fixed using synthetic data). Some models such as decision trees work well with wide data. Regression models don’t do well in this context. However, you can put some constraints on your features to reduce dimensionality, or use data reduction techniques such as principal component analysis. Or you can segment your data set and use a different subset of features depending on the segment. Typically, wide data leads to non-uniqueness in the optimum solution. While many practitioners view this as an issue, I personally embrace non-uniqueness. It gives you more insights about your data, by showing you a wide range of potential models and explanations. In practice, when dealing with many features, some are redundant and may be ignored.
If you built a recommendation system customized to each of your 300 million users, it may seem impossible to avoid big data. Maybe you need to store millions of deep space videos to study exoplanets. In my opinion, the only issue with big data is storage. You can reduce storage by keeping only summarized data in production mode or for old data. If you assign a segment ID to each customer or video (which amounts to creating a taxonomy), you can group them in clusters, thus reducing storage. Your summarized data should be just granular enough to make your predictions.
Due to measurement errors, seeking increased accuracy may be a futile exercise, at least in many contexts. However, if you run algorithms with many iterations with propagating errors (say, to find an optimum), you want to make sure that the quality of the poor measurement does not deteriorate further when processing the data. I provide a dramatic example in my previous DSC article: see section “When a Wrong Solution is OK, and When it is Not”, in this article. A simple strategy is to reduce accuracy from 14 to 10 , 7 and 4 digits, to assess the impact on final results. Also, find out (checking the literature or by contacting your vendor) if/when the algorithm in question is numerically unstable. A well designed system should give you a warning, for instance “Determinant cloze to zero, regression coefficients meaningless in this case”.
While there are benefits in having siloed data — each team working on a particular set of features of observations locally stored — the problem is consistency and synchronization with a central repository. Stakeholders should decide when this is important (to avoid the problems discussed in the first paragraph in this article), and when it is not. In my case, I frequently worked on my own data sets, whether generated myself, downloaded from a central database, or coming from a third party. Mostly, to prototype predictive systems. My data sets were not always synched with a central database and, and it worked well that way. A benefit of siloed data is reducing the number of access to central databases, to not slow the system for everyone. However, I always made sure that all my experiments were fully replicable.
Security and Privacy
Non-qualified employees should not have access to data sets containing credit card data, social security numbers, or non-anonymous medical records. These fields should be removed or encrypted, possibly using a better technique than MD5 (see why here). Likewise, storing email addresses and personal info must done only when necessary. Allow the user to “sign-out” fairly simply and quickly. It is always better to shared anonymized data, if you need to share it with third parties. Again, having your machine learning modelers work on synthetic data whenever possible (or a blend of synthetic and real data) is one way to avoid these risks.
Bias in Data or Results
Under-sampling some segments of the population can result in bias. Rebalancing the data is one way to deal with this. Common sense can help identify vast segments that are missing, for instance recovered people in the early days of Covid. It boils down to capturing the right data to begin with, or complement it with external sources or proxy data (like sewage data for Covid prevalence). Using very rich synthetic data helps reduce bias as well. If you test your data on well separated clusters, you don’t know how your model performs on asymmetric data. Synthetic data can help you include all potential cases, for better model fine-tuning. For instance, skewed data, outliers, unbalanced mixtures, non-symmetric distributions, overlapping clusters and so on.
That said, seeking total absence of bias for the sake of it may be a waste of time or sometimes impossible. A good simple model with little bias is better than a complicated bias-free solution that is more difficult to implement and interpret. Anyway, the absence of bias is linked to the model, not to your data. In the end, your data is not as ideal as your model. In some instances, biased data can result in litigation. Some black-box systems for loan approval are now illegal, because of bias somewhere in the system, and lack of interpretability.