Subscribe to DSC Newsletter

Of course one way to win is play by the rules and submit the best answer. But since most of these challenges are about predicting something, what about a candidate who creates 5 accounts with 5 different IP addresses, and submit 5 different predictions to a same contest? Wouldn't he increase his odds of winning from 1 out of 10 to 1 out of 2? This could create professional cheaters, who participate in many contests, and regularly win. Since Kaggle claims to have 100,000 data scientists (and does it include you?) there is a possibility that many accounts are duplicate.

What do you think? Are there any barriers in place to prevent this fraud from happening?

Disclaimer: I have never participated in a Kaggle competition. I am not one of the 100,000 Kaggle data scientists.

Related articles

Views: 10044

Reply to This

Replies to This Discussion

I disagree a bit. That's why you have a test dataset: it's not just ONE observation. I've never joined such competition, but I bet this approach will actually work.

It would not really work.  This is because the distribution of entries by someone who does not have a good model, would be very different from the distribution of answers of someone with a good model.  The exception is when it is possible to learn from the results of your submission.  In this case every submission creates a piece of information (the score of that submission) that can be used to tune the guesses.  This was the case in the Heritage Health competition: guesses could be used to probe the unknown response to get central tendencies for selected observation subsets.  This was countered somewhat by doing the final scoring on a holdout sample.  The fact that the top players joined together in teams instead of submitting separately shows brainpower beats multiple submissions.  Also the fact that you can submit one answer per day and select your top submissions for the final scoring, helped reduce the advantage of registering multiple times.

Additionally, several money prized competitions require the competitor to actually submit the source code. Is almost like the host buys the licence to use the top competitors code or approach.

Still this fictitious competitor your suggest could accumulate good results in many competitions ending up being eligible to the Kaggle connect (the consulting platform). But like Harlan mention, the final ranking is evaluated in a holdout sample crippling the attempts to overfit using the evaluation feedback. 

In most of the competitions I participated, I ended up increasing several positions in the final evaluation probably because I never use the submission feedback in my models.

Actually, Kaggle has anticipated this and their official rules specifically state you cannot have duplicate accounts. I'm not sure how they audit this, but they are definitely aware of the potential for fraud.

Account duplication is easy to accomplish, if you are a real data scientist with fraud detection background. Typically, good quality duplication uses multiple IP addresses, multiple email addresses etc. Read my article Botnets in the cloud: the new generation of spammers. For smart kids in Ukraine where a $5,000 price represents tons of money, the temptation to cheat could be high. 

There should be a contest where the goal is to register the most accounts.  The contest host would run algorithms to detect and delete duplicate accounts.  The winner would be the one successful at fooling those algorithms.   The method used by the winner would be published.  

I guess my point is that "a real data scientist with fraud detection background" would be highly educated, most likely with an advanced degree so exactly why would a successful person like that with very high earning potential want to risk everything thing and commit a crime? Highly doubtful. Such a person could make more just playing it save in his/her profession, or maybe on Wall Street. Smart kids in the Ukraine probably don't have the data science skills necessary to pull off a Kaggle fraud.



Vincent Granville said:

Account duplication is easy to accomplish, if you are a real data scientist with fraud detection background. Typically, good quality duplication uses multiple IP addresses, multiple email addresses etc. Read my article Botnets in the cloud: the new generation of spammers. For smart kids in Ukraine where a $5,000 price represents tons of money, the temptation to cheat could be high. 

Yes, there is a potential for fraud; yes, Kaggle has measures in place to prevent it; and no, those provisions are probably not perfect.  If you're entering Kaggle contests as a way to feed your children, you may want to consider finding a job.  If you're entering Kaggle contests as a way to improve your modelling skills, cheaters are probably not going to hold you back.

Children - heck if they want to eat, they should be winning contests on their own, right?

Vincent, I don't really see the point in submitting multiple entries (unless if it is to grab multiple prizes when there is a 1st, 2nd, 3rd, etc ).

If it were a draw, it would make sense to say multiple entries would increase your chances of being selected, but since most of the competitions are based on the best results and you are allowed to re-submit your better result as you superseed your previous ones, I think this could even backfire since you could have a better result coming from any of your models.

As for cheating, I think most people with this kind of knowledge can find better use for their time.

And Mr. Daniel D. Gutierrez, I do believe there is a lot of smart kids in Ukraine with the data science skills necessary to pull off a Kaggle fraud...

 

If you were born in a wealthy family and never had to worry about where your next lunch will come from, and how you are going to get it, cheating on Kaggle might look like a ridiculous idea. For the 80% of the 7 billion people on Earth who were born in poverty, it is attractive to cheat on Kaggle for survival. Granted, only 1% of these poor people are smart enough to succeed, but that's 50,000,000 people. And interestingly, many Kaggle participants live in the poorest countries. And many who claim to be in US could be fake.

One thing good about Kaggle when it started out was that it was a non-elitist opportunity.  The only thing that mattered was your ability to solve problems: those people living in poor countries without any other opportunity could compete.  Now with the closed competitions,  Kaggle is becoming more and more an elitist community.  I think that is a too bad.  But "cheating" or not, you still have to find the top solution to the problem.  I think finding the top solution should be the only criteria.  It is up to Kaggle to make sure they measure the winning solution in an accurate way.  The hold out sample does that.  So in order to cheat you would have to figure out how to game the holdout sample.  Other than breaking into the Kaggle database to steal the sample, I don't see any other effective way to cheat.  

Vincent Granville said:

If you were born in a wealthy family and never had to worry about where your next lunch will come from, and how you are going to get it, cheating on Kaggle might look like a ridiculous idea. For the 80% of the 7 billion people on Earth who were born in poverty, it is attractive to cheat on Kaggle for survival. Granted, only 1% of these poor people are smart enough to succeed, but that's 50,000,000 people. And interestingly, many Kaggle participants live in the poorest countries. And many who claim to be in US could be fake.

RSS

Videos

  • Add Videos
  • View All

© 2020   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service