How *big* your data is depends on the quantity of information that it contains (measured using entropy metrics), rather than the number of terabytes. Huge data that is sparse or shallow is indeed not huge - and can be compressed very efficiently. What do you think?

Here's Stfan Carandy's viewpoint (founder of Bayesia Networks):

If I may cross-post the following from our blog at www.conradyscience.com, which speaks to the same point:

**Learning = Data Compression**

"It has long been understood that even when confronted with a ten-gigabyte file containing data to be statistically analyzed, the actual information-theoretic amount of information in the file might be much less, per haps merely a few hundred megabytes. This insight is currently most commonly used by data analysts to take high-dimensional real-valued datasets and reduce their dimensionality using principal components analysis, with little loss of meaningful information. This can turn an apparently intractably large data mining problem into an easy problem." [1]

As an alternative to dimension reduction, we can exploit existing regularities in the data to create a more compact and thus more tractable representation with Bayesian networks. "In context of Bayesian network learning, we describe the data using DAGs [Directed Acyclic Graphs] that represent dependencies between attributes. A Bayesian network with the least MDL [Minimum Description Length] score (highly compressed) is said to model the underlying distribution in the best possible way. Thus the problem of learning Bayesian networks using MDL score becomes an optimization problem." [2] Consequently, learning Bayesian networks is inherently a form data compression.

References:

[1] Davies, S., and A. Moore. “Bayesian networks for lossless dataset compression.” In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 391, 1999.

[2] Hamine, Vikas. “Learning Optimal Augmented Bayes Networks” (n.d.).http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.6100.

- Juniper adds Mist AIOps to its 128 Technology-based SD-WAN
- 10 microservices patterns all architects should know
- IBM extends Call for Code for Racial Justice program
- citizen development
- How to manage third-party risk in the supply chain
- Gartner predicts data storytelling will dominate BI by 2025
- AWS Data Exchange and the third-party cloud data marketplace
- Overcome common IoT edge computing architecture issues

Posted 1 March 2021

© 2021 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central