I am new to machine learning techniques and I am not sure whether if there is any solution to problem that I am facing. The task that I want to find a highly nonlinear boundary is 3 dimensions but these dimensions I cannot select as features. So, if I want to find a decision boundary in x,y,z dimension, the inputs that I have are a,b,c,d..., etc. Of course, the inputs are correlated to the x,y,z dimension. The input features should model the boundary. In other words, as input feature values change the the decision boundary changes. I have huge data for inputs (a,b,c,d....) and respective decision boundary for those particular inputs.
Can anyone suggest me any suitable algorithm or methodology for such problem? Any small help is appreciated. Thanks in advance.
- This is the problem with purely tree-like decision rules. Reducing the number of variables is a goal of things various classification algorithms (see books coauthored by a person named Hastie). Eliminating variables is sometimes not too advantageous from the perspective of interpretation or planning.
- Instead, you might consider principal components analysis (PCA) reduction into your x,y,z dimensions, using the principal component scores instead of a,b,c,d.... However, in PCA, a,b,c,d... are likely to be correlated with each of the x,y,z dimensions to a greater or lesser extent. Rotation, usually the "Varimax" method, is helpful. If a,b,c,d do not correlate systematically with x,y,z as principal components (a type of latent variable), then you should revisit your theory of how a,b,c,d should relate to x,y,z. Otherwise, the question moves to the next level (next paragraph). Also, please understand that statistical correlational or PCA is not the only way of tying a,b,c,d to x,y,z, but PCA is the most basic, and other methods tend to build on its principles. PCA scores are linear (weighted) combinations of the inputs a,b,c,d to provide a score for each of x,y,z.
- Next level. PCA scores for each of x,y,z can easily be used to create tree-like or linear model-based decision rules. Something like C5 or the more popular CHAID program would work to generate tree models. Alternatively, a multinomial / binary logit or multinomial / binary discriminant analysis could be used to generate classification rules based on linear combinations of the scores themselves (side note: PCA scores are more "normal" in distribution because they are additive in makeup).
- In data science, we rely on classification accuracy as long as it is robust (e.g., 10 random samples from the population have equally (or nearly equally) good correct classification rates) and reliable (strongly statistically significant parameter estimates and overall model fit statistics)). Neural net is also an alternative, a unique combination of tree and linear models.
- Final level. Does your predictive model, robust and reliable, produce enough of a return on investment to use in your organization? Because we (data scientists) are often seen as a "dark" resource, we become vulnerable if we waste time on things that aren't going to "lift all boats," as the metaphor goes. Although that principle may not have been articulated by management, you can bet that it will pop up if there is a lack of gainful/profitable output from the DS team.