Last week witnessed a number of exciting announcements from the big data and machine learning space. What it shows is that there are still lots of problems to solve in 1) working with/deriving insights from big data, 2) integrating insights into business processes.

Probably the biggest (data) headline was that Google open sourced TensorFlow, their graph-based computing framework. Many stories refer to TensorFlow as Google's AI engine, but it is actually a lot more. Indeed, like Spark and Hadoop, it encompasses a computing paradigm based on a directed, acyclic graph (DAG). DAGs have been around in the world of mathematics since the days of Euler, and have been used in computer science for decades. The past 10-15 years have seen DAGs become popular as a way to model systems, with noteworthy examples being SecDB/Slang from Goldman Sachs and its derivatives (Athena, Quartz, etc.).

What differentiates TensorFlow is that it transparently scales across various hardware platforms, from smartphones to GPUs to clusters. For anyone who's tried to do parallel computing in R, knows how significant this seamless scaling can be. Second, TensorFlow has built in primitives for modeling recurrent neural networks, which are used for Deep Learning. After Spark, TensorFlow delivers the final nail in the coffin for Hadoop. I wouldn't be surprised if in a few years the only thing remaining in the Hadoop ecosystem is HDFS.

A good place to get started with TensorFlow is their basic MNIST handwriting tutorial. Note that TensorFlow has bindings for Java, Python, C/C++. One of their goals of open sourcing TensorFlow is to see more language bindings. One example is this simple R binding via RPython, although integrating with Rcpp is probably preferred. If anyone is interested in collaborating on proper R bindings, do reach out via the comments.

What's in a name exactly? Tensors are a mathematical object that is commonly said to generalize vectors. For the most part the TensorFlow documentation refers to tensors as multidimensional arrays. Of course, there's more to the story, and for the mathematically inclined, you'll see that tensors are referred to as functions, just like matrix operators. The mechanics of tensors are nicely described in Kolecki's An Introduction To Tensors For Students Of Physics And Engineering published by NASA and this (slightly terse) chapter on tensors from U Miami.

Another notable computing platform is Ufora, founded by Braxton McKee. Braxton's platform differs from TensorFlow and the others I mentioned in that it doesn't impose a computing paradigm on you. All the magic is behind the scenes, where the platform acts as a dynamic code optimizer, figuring out how to parallelize operations as they happen.

What made the headlines is that Ufora decided to open source their kit as well. This is really great for everyone, as their technology will likely find its way into all sorts of places. A good place to start is the codebase on github. Do note that you'll need to roll up your sleeves for this one.

Last week in my class, we discussed ways of visualizing multidimensional data. Part of the assignment was clustering data via k-means. One student suggested using PCA to reduce the dimensions into 3-space so it could be visualized. In Kuhn & Johnson, PCA is cited as a useful data preparation step to remove noise in extra dimensions. This suggests pre-processing with PCA and then applying k-means. Which is right?

It turns out that PCA and k-means are intimately connected. In K-means Clustering via Principal Component Analysis, Ding and He prove that PCA is actually the continuous solution of the cluster membership indicators of k-means. Whoa, that was a mouthful. To add some color, clustering algorithms are typically discrete: an element is either in one cluster or another, but not both. In this paper, the authors show that if cluster membership is considered continuous (akin to probabilities), then the k-means solution is the same as applying PCA!

Back to the original question, in practice both approaches are valid and it really boils down to what you want to accomplish. If your goal is to remove noise, pre-processing with PCA is appropriate. If the dataset becomes easily visualized, that's a nice side effect. On the other hand, if the original space is already optimal, then there's no harm in clustering and reducing dimensions via PCA afterward for visualization purposes. If you take this approach, I think it's wise to communicate to your audience that the visualization is an approximation of the relationships./

*What are your thoughts on using PCA for visualization? Add tips and ideas in the comments.*

It's true, the jet pack is finally a reality. Forty years after their first iteration of the RocketBelt, the inventors have succeeded in improving the flight time (from 30 seconds 40 years ago) to 10 minutes and a top speed of around 100 kmh.

https://www.youtube.com/watch?v=f3AwBSwFV2I

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central