Subscribe to DSC Newsletter

Hi all,

I would like to apply a classification model on two real data sets. The first dataset has 5  groups/class labels (e.g. A, B, C, D and E), and the second has only 3 (e.g. A, B, and C). I have to use the first one as the training set to build the classification model and the second one as the testing set.

Note, all the values of each feature/variable are available and not missing. 

Which of the two options is the appropriate one?

1) I should eliminate all samples of the two extra labels (i.e. D and E) from the training set and then build the classification model.

2) I should keep all samples of the two extra labels in the training set and then build the classification model.

I am in favor of option 2, but I am not so sure.

I will appreciate any help.

Thanks in advance,

Samer

Views: 986

Reply to This

Replies to This Discussion

Test set is used to check the model metrics but in Real time prediction there might be possibility you will get all the possible classes which are A, B, C, D and E and if you only have A, B, and C classes in data which you are going to predict that time it is better to remove those classes else you can add some D,E class's sample from training set to test set. That time data inequality won't be there.

Interesting... could you give more context about the problem and about why classes D and E are missing in the test data? For instance, if they just happen not to occur but in general are expected to occur then I would say option 2 is the most sensible (this implies that if the model predicts class D or E, it should be marked as an error).

Daniel

Raghunandan,

First, I want to thank you for your help. I really appreciate it.

Your suggestion of removing D and E will mean that the training data is influenced by the testing data. Is that allowed? Both sets should be independent (i.e. when we deal with one we pretend that we don't know anything about the other one)

I don't want to mix the training dataset with the testing dataset. Each dataset was produced by different lab/source. Keeping them separated will show that the classification could be applied to data from independent sources. So, I will avoid this option. 

Thanks a lot for your help,

Best,

Samer

Daniel,

I am glad that the question was interesting to you.

Regarding your question: Why the classes D and E are missing from the test set?

I am downloading each dataset from a different source, and the source of the testing data was missing the samples of D and E, but in reality, samples of D and E could be generated in the future. 

Regarding: when the model will predict D or E this should be marked as an error. Do you mean False Positive, or and error meaning this result should not be considered?

Thanks again for your help.

Best,

Samer

I mean that the prediction should be marked as an incorrect prediction (thinking in terms of classification table).

Samer Hanoudi said:

Daniel,

I am glad that the question was interesting to you.

Regarding your question: Why the classes D and E are missing from the test set?

I am downloading each dataset from a different source, and the source of the testing data was missing the samples of D and E, but in reality, samples of D and E could be generated in the future. 

Regarding: when the model will predict D or E this should be marked as an error. Do you mean False Positive, or and error meaning this result should not be considered?

Thanks again for your help.

Best,

Samer

Hi Samer 

Option1:

If you don't remove other two features then model will be biased which might result in to bad classification (depending on importance of those two extra features).

Option2:

If you don't want remove, we need to find pattern inside the data set (missing value strategy, look for more details in web), but need to to be very careful.

Hi Ravindra,

Thank you for your interest in my question. 

I just want to clarify that the samples/classes of group D and E are missing from the testing data. All the values of each feature/variable are available and not missing. 

I will update the discussion to clarify the problem.

Best,

Samer

RSS

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service