A Discussion: IT Data, Ambiguities & Classification model performance

“Ambiguity is pervasive” – true to its definition, as increasingly data getting generated, system connectivity reaching its peak, data and outcome are diverging. IT systems are evolving from “BIG DATA” to “BIGGER DATA” systems. Not all of this data is structured and easily consumable, thus challenge is posed by nexus of technology & “Data Greed”.

Having said this, fact is that future is found in ambiguity and chaos. We will never have complete and perfect information or a full understanding of data, system, experts, people, process and “partially hidden” technology. Challenges are less about increasing volume of data and more about finding the meaning in the data and associating it with relevant actions, that deliver intended business values. Why the data is ambiguous? One of reasons is lack of clarity in outcomes and business values and their correlation with data required. If data required is not clear, then data acquisition and sources are not clear. Thus, whole eco-system of data gathering/acquiring, transferring, sharing, storing, distributing carries good amount of uncertainties which is amplified down the line in data supply chain. Changing business goals, market, expectations etc. add to this chaos even further!

One of the various requirements to come out of data ambiguity and “information dark spot” is ability to explore the data, getting to the root of “usable” information and developing a cloud of meaning around the data to present it in meaningful form.

Thus, the cleaning, imputing, deriving minimum viable set from data is first step in de-cluttering of ambiguous data set. Once features, intents are identified from data set, ample amount of time should be spent on brainstorming, discussion with SMEs etc. to understand and assign categories to each feature. Feature categorization can be manual, which is cumbersome or automatic. Discussing these steps in detail is outside the scope of this article, I would outline classification techniques and limitation of their performance on IT data.

Once sufficient information and knowledge is gathered about nature of data and its features, effort should be put to categorize the features. Various algorithms are available to assist and achieve this for all forms of data. One of them is Support Vector Machines, a supervised classification algorithm, has ability to address limitations posed by other classification models. SVM uses overfitting protection, they have potential to handle large feature spaces. SVM for IT data gives better recall and precision in predicting true positives.

Another classifier is Maximum entropy. The principle of maximum entropy states that, subject to precisely stated prior data (such as a proposition that expresses testable information), the probability distribution which best represents the current state of knowledge is the one with largest entropy. Convergence time of max entropy is least for IT data under discussion and its accuracy is >60%, which is as good as SVM.

Elastic net fails due to challenges in lambda convergence for larger feature sets. Similarly, Random forest and neural network consume highest amount of time to converge.

Random forest and SVM are non-parametric models, means complexity grows as number of training size increases. Training of non-parametric model can be cumbersome, and expensive when compared with generalized linear models [GLMNET etc.].

It has been observed from study that all the classifiers fail to converge in few hours for feature sample size > 3000, irrespective of R or Python. One wonders what is right amount of feature set/training size to apply on parametric as well as on non-parametric models? Answer lies in understanding theory of classification models. SVM requires “right” kernel, regularization penalties, slack variables to be tuned. Thus, SVM works better for fewer feature set with fewer outliers. SVM uses grid search hyperparameter optimization with each pair of (C, γ) in the Cartesian product. Grid search suffers from “curse of dimensionality” and is often parallel because typically the hyperparameter settings it evaluates are independent of each other. Data set under discussion has multiple features that overlap among various categories and thus distinct hyperplane and grid structure is challenge. This leads to convergence issues which is reflected in “lower” recall and f-score.

Random forest on other hand, works on creating multitude of decision trees, outputs the class that is mean prediction of individual trees. Random forest tends to overfit. Weighted neighborhood is built on training set and when new feature set is input it predicts new class based on trained neighborhood. Challenge here is “overlapping features”, large number of features, leading to enormous decision trees and hence overlapping weighted probabilities for tree. Since whole forest is again weighted neighborhood on new training set computation time and hence convergence is huge and make it unfit for real time implementation.

Thus, performance of these classifiers for larger feature sets opens-up the challenges in real-time/online implementation of text classification models. Such limitations of classification models and insufficient set of features have been currently driving data into ambiguity and ambiguity into chaos. To come out of such scenarios, intensive feature engineering and supervised modelling should be applied.

Next article I shall be covering in details what is right number of features to be selected, practical steps involved in feature engineering when dealing with IT infrastructure and Classifiers versus operational scenarios.