I'm new in the Data Science Central community and I need some help. I start a discussions, maybe the community can help me to understand how wrong I can be in Python. Coming from the Business Intelligence field and trying to advance in Machine Learning and Big Data.
I started learning R maybe 5 years ago. No spectacular projects, just few courses on EDX and Coursera with their applications. This year I wanted to learn Spark to start a project with a lot of uncertainty about the volume of data and features, applications and so on. I observed I need a new language to have a better compatibility with Spark and I chosen Python. Then I started to do the same things I did in R but with Python - for example the Analytics Edge course on EDX, which offers a good learning opportunity in my opinion.
My first surprise was to see how many different libraries are in Python that do the same things. And then sklearn that doesn't offer minimal details of, for example, a linear regression summary including statistical significance of the coefficients and at least adjusted R2 for the model. But only R2, intercept and coefficients. To get these details I should use statsmodels.
Then I moved forward to classification and regressions trees, random forest, and clustering. Completely different trees and clusters resulted in Python. I took the default parameters from R and used in Python. The same, completely different trees, different splitting nodes on different variables and splitting values. Also the results are wrong - for example Men in Black movie clustered as a Comedy rather than an Action + Adventure + SciFi movie.
I remember when I learned R it was so smooth and the results were very accurate.
I wonder what to do. I just needed Python to by able to approach Spark. Now I'm so confused: if we get to people, one using R and the other using Python, then these two people can come to completely different results, recommending their management, for example, completely different actions.
Having so much randomness in the field, one more level brought by the difference between the used platform adds even more odds.
Thank you very much for contributing with your advises.
Tags:
If you have one watch you always know the time, but if you have two watches you can never be sure. :-)
Different implementations often yield different answers for a given problem. Explaining a difference typically requires getting into the specifics of the implementation. It's hard to speak in generalities because it's usually something in the details that drives any differences.
So for example if we're talking about DecisionTreeClassifier in sklearn, and the code is doing a split, there are two ways to run it: best split (default) and random split, so depending on that setting a given problem may yield different splits and a different tree. Within best split the code samples up to max_features without replacement Fisher-Yates style, so you might get splits which are not deterministic unless max_features = your number of features. I suppose setting random_state to some integer would yield the same "random" answer each time as well. We also have parameters such as min_samples_split that control the outcome.
If we're talking about tree a classification tree in R, then we're talking about an implementation that will use either Gini or deviance to measure impurity when deciding the split with additional parameters like mincut and mindev that control splitting.
Any and all of these details can lead to different answers, and this is just for two implementations of a moderately complex procedure.
Even if algorithms are deterministic, you'll see different tie-breaking behavior in different implementations, but this usually doesn't yield very different answers.
Another thing to watch out for, and this is happens for even simple functions, is apparent difference due to conventions. So for example one statistical package might return Kurtosis after subtracting 3 (e.g., Excel KURT() function), whereas others do not subtract 3 by default.
Hope this helps. Good luck!
Hi Justin,
Thank you very much for your explanations!
I understood earlier the hyper-parameters part but it looks not in detail. Only conceptual, but not fundamental. Now I have a reflection moment: the randomness doesn't stop to random variables, but it includes in the entire chain:
- random variables
- learning algorithm
- hyper-parameters
- platform...?
Did I miss anything? :) Probably the design of the study, the designer, the person collecting the data, the ML team experience... shall I stop?
So, now I have a bit bigger problem: do we have to include a distribution of algorithms and platforms in solving a ML/AI problem?
I think for some problems, the number of algorithms is very limited (CNN, RNN), but for other problems, the number of algorithms is larger (Linear Regression, Logistic Regression, SVM, Decision Tree, Random Forest, etc.).
On the other side, selecting a platform could be simpler (?) - take one from many satisfying a list of criteria: performance, scalable, learning curve, available features/algorithms/graphical interface, support, etc. I think I've seen lists of criteria even on Data Science Central.
But this takes me to a another discussion topic maybe: if someone tries to replicate a study it easily can happen to have a different conclusion... Anyway: how it finally works? because it is clear that the things work as we can see a lot of applications. What makes us being confident in saying: this solution is fit for this problem?
These are interesting questions. If when you have two watches you can no longer be sure of the time, when you have three or more watches you appreciate the difficulties and subtleties of telling the time.
These days we certainly have many choices of platform, algorithm, etc., and like the person with many watches may find it harder to feel confident in saying we've found "the right answer" to a problem or even that there exists a single right answer.
Concerning this item:
So, now I have a bit bigger problem: do we have to include a distribution of algorithms and platforms in solving a ML/AI problem?
It makes sense to consider different algorithms. When we're early in the process and still getting a feel for which features are best, running through multiple algorithms is helpful for identifying important features: if we see the same features coming out on top again and again, we ought to study these features very closely. What we learn by studying these features can then better inform our selection of candidate algorithms, which then feeds into feature selection, additional discovery, etc., hopefully after some iterations of feedback eventually cycling into something that works well.
Another thing comes to mind: having competing answers can be a good thing. For example, home loan lenders in the United States typically obtain three credit scores for each applicant, one score from each credit reporting agency. These scores range from 300-850 and estimate the applicant's willingness to repay. The scores may differ between agencies (each agency uses a different model and may have data differences), so lenders take the middle of the three scores as being representative. One agency may report a much higher or lower score than the others without affecting the lending decision, whereas if only one agency was used, such outliers would change the outcome. Having competing answers helps us here.
Concerning this question:
What makes us being confident in saying: this solution is fit for this problem?
A few things come to mind. First: conceptual soundness. Consider for example the classic application (going back to the 1950s) of modeling the motion of cars on a freeway using the mathematical models for the motion of molecules in a gas. It must have seemed bizarre to many at first to treat cars as gas molecules (it struck me as strange when I first learned of this), but upon reflection you can see how drivers like to fill in gaps in the freeway, so there's some degree of soundness. Using credit score again as an example: it is conceptually sound to consider than someone with a history of paying all their bills on time might have a higher willingness to repay than someone that has often paid their bills late or perhaps defaulted on their obligations.
Second: outcomes analysis. The tests depend on the specifics, so we might be talking about accuracy or evaluation of rank-ordering or whatever is appropriate. It's hard to generalize here, but there will be some test or tests that evaluate the outcome, and if approach A does better than approach B, A may be preferred. Back testing is called for certainly. A good example from the financial industry is value-at-risk (VaR).
Third: sensitivity analysis. Vary several inputs simultaneously; look for unexpected big shifts. Check extreme values. If approach A is more finicky than approach B, prefer approach A. Or you may find that approach A works better over a certain domain, approach B over a different domain. Perhaps a given approach has a limitation, e.g, model A works only in a stable interest rate environment, model B works over interest rates as long as they are non-negative.
Fourth: ongoing monitoring. Maybe approach A worked just fine when first adopted but its performance degraded over time. For example, maybe the movie classification algorithm worked well for movies up through the early 1990s, but as movies changed the algorithm struggled with edge cases like Men In Black (it has a lot of jokes, maybe it's a comedy, but it has aliens, maybe it's Sci-Fi). Always have some kind of ongoing monitoring to see if a solution needs re-tuning or replacement.
This is a very interesting topic; a lot could be said here.
Hopefully this is helpful. Best of luck, Iosiv!
Thank you very much Justin!
Very helpful indeed.
I hope I'll be able to talk so easily about the subject as you talk. It's definitely a very large field requiring knowledge more than other fields.
Now we have technology to do whatever we want. We just need skills. And the period needed to learn the field cannot be too short from what I've seen so far.
© 2020 Data Science Central ® Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles