I am solving a multi label multi class classification problem using random forest algorithm with the help of binary relevance method, I am using label encoder to convert categorical values.
I would like to know how to handle, if an unseen value which was not present in the training set has appeared in the test set ???
Many libraries handle such unseen value like if it was a NaN i.e. the unseen value will take the same branch in the tree than the NaN.
This indeed needs to be avoided as much as possible because unseen values do not bring any information and can degrade the predictive performance.
A way to limit the impact of such unseen value is to create new features for which we know we won't have any missing value.
Such new features need to have small cardinality (number of possible values) to be sure to see all values in the training data.
If an unseen value happens in test data for the feature F, you can try to create a new feature F' derived from F with lower cardinality by grouping values together.
For example if F is the feature "City" with all possible cities in the world as values, you can add the feature F' which is "Country". That way all cities of the same country will have the same value for feature F'.
Another example is to add the feature "Domain" from the feature "Job" (job="developer" => domain="computer science", job="data scientist" => domain="computer science", job="psychiatrist" => domain="medicine").
If a city was unseen but you have seen another city of the same country, you can use this information thanks to the new feature F'.
You can use clustering on F to create F'. Every cluster of values of F will form a value for F'.
The lower the cardinality of a feature, the easier it is to cover all possible values in the training set.