I became a Data Science practitioner in probably the most peculiar, counter-intuitive way of becoming one. Before being confronted by the embroidered Python or R dilemma, before being exposed to Data Analytics Platforms, Math and Statistical theory; I concerned myself primarily with what I can best term “*Data Transmutation*”. I intuitively knew that before I could *do* Data Analytics, I had to somehow transmute the Data. I would primarily be mining Legal Data; its unstructured nature and textual rigidity would compel me to develop a way of *modifying* it into a form that is Analytics-receptive. This was before I even knew what ETL was.

I soon began quantifying Legal permutations and expressing them as mathematically weighted numerals for the purposes of efficient Data Mining. This method is designed specifically for Machine Learning Algorithms that require numerical attributes and weights for their calculations. I however realized that beyond mathematically quantifying Legal Data, it was in fact *altering* the Data completely and (to my surprise) not just Legal Data. That is how Data Transmutation as I understand it, was born.

Data Transmutation creates a synthesis of multiple events that are first encapsulated into a single expression, then transposed into a math function and finally transmuted into a coherent data point. It condenses Data into mathematically calculable numerals and symbolic expressions that weight the factual permutations of events and occurrences. The results are highly potent Data points; it’s like condensing Lite Beer into a liquid with an alcohol content of 100%. This distillation of Large Data sets into highly concentrated but rational Data points is very helpful. Data Transmutation is a bit like gene extraction. All data tells a story (some stories more exciting than others) and every story has a core phenotypical structure and genome. Data Transmutation is a way of delineating the Data into “*DNA*” strands that represent the foundational archetype of the story the Data is telling. Transmutations mine the primary consequence of events and occurrences by isolating the systemic functions of data; thereby extracting only salient truths. Think of the process of Diffusion in Biology, which means something that goes from a very high concentration to a low one once it expands and occupies larger spaces. Through Diffusion, a gas loses its potency and efficacy as it begins to spread, very similar to what happens during the collection and architecture of Data. When you Extract, Transform and Load Data, you’re essentially taking a series of events and fragmenting them into scalable features for the purposes of Algorithmic enquiry. This fragmentation increases density, widens factual parameters, increases variation and ultimately “*diffuses”* the efficacy of the story. However when Data is transmuted into an condensed form; factual parameters aren’t unnecessarily expanded, the features become more salient, the density remains the same and variation is kept at a healthy level.

here is of course Sampling, Feature Optimisation, Feature Generation (combinational vector creation) and many other tools which are all used to perform the function of distilling data into a state of optimum lucidity. However there is a difference between Transmutation and Segmentation, which the above tools essentially are. They minimize and optimally abridge the Data to be analysed, they do not fundamentally mutate it.

Discretization methods are ubiquitous on all good Analytic platforms, that is probably the closest you can come to changing the aesthetic identity of data points without using Data Transmutation. You could certainly use Discretization to convert numerical attributes (where some entries are “0”) into binary attributes detailing “Yes” or “No”. This however cannot be done without compromising the structural and probative integrity of those data points.

Nevertheless, the definitive feature of Data Transmutation is the ability to mathematically calculate the value of a transmuted data point, without using Analytics to do so. Consider a classification model for Gold for example. One of the data points under the attribute “Pressure and Temperature Data” has been Transmuted from 27.0 GPa of Sheer Modulus (original data point), into a mathematically weighted expression of (P+) 4.833 (Transmutation Value). Because the data point has been delineated into a math function, it is possible to calculate the mathematically representative value of Sheer Modulus, in short-hand form, without using any code or software: with pen and pad. Think of Machine learning Algorithms that have the ability to produce Formulas for their results or at a very abstract level, even Map Reduce; Data Transmutation works in a similar way.

I am not saying that this method of altering data is a divine panacea, just like a Machine Learning Algorithm there are conditions and parameters that it must satisfy to perform optimally. All I am saying is that there is a way of changing data to facilitate a more advanced method of Machine Learning. Unfortunately Transmutation will inevitably lengthen the already protracted ETL process, however the rewards are bountiful. As Data Science practitioners we should let go of the fear of “corrupting” data. Change it as you see fit and you may be pleasantly surprised by the results.

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Upcoming DSC Webinar**

- Optimization and The NFL’s Toughest Scheduling Problem - June 23

At first glance, the NFL’s scheduling problem seems simple: 5 people have 12 weeks to schedule 256 games over the course of a 17-week season. The scenarios are potentially well into the quadrillions. In this latest Data Science Central webinar, you will learn how the NFL began using Gurobi’s mathematical optimization solver to tackle this complex scheduling problem. Register today.

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Upcoming DSC Webinar**

- Optimization and The NFL’s Toughest Scheduling Problem - June 23

At first glance, the NFL’s scheduling problem seems simple: 5 people have 12 weeks to schedule 256 games over the course of a 17-week season. The scenarios are potentially well into the quadrillions. In this latest Data Science Central webinar, you will learn how the NFL began using Gurobi’s mathematical optimization solver to tackle this complex scheduling problem. Register today.

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central