Tips by 4 DataHack Winners
Nalin Pasricha, DataHack Rank 1
Nalin is an investment banker turned data scientist who currently works as an independent consultant.
He has participated in 17 hackathons at DataHack. He won Data Hackathon 3.x and emerged as the 1st Runner Up in Black Friday DataHack.
Here’s what Nalin has to say:
- Our mind works subconsciously at night on our problems in a very powerful manner. So I try to start work on the problem as early as possible so that my mind has at least one night to work subconsciously on the problem.
- Read inspirational books or watch inspiring videos during a competition. I think it really helps your mind to go beyond its usual limits. I remember I was reading ‘The Wright Brothers’ by David McCullough during one hackathon. It’s the story of two brothers who were only bicycle manufacturers, they had not even attended college, they had no funding, and still they managed to make the world’s first aeroplane, beating top scientists, universities etc. I did really well in the hackathon mainly because my mindset was changed due to this book.
- Try to use a package or language that is new to you. It’ll make you think differently and spur your creativity. I normally use R, but when I try to use Python instead I think I come up with unusual solutions.
Sudalai Rajkumar (SRK), DataHack Rank 2
SRK is a Senior Data Scientist at Tiger Analytics. He is currently positioned at Kaggle Rank 23 and bestowed with Grandmaster Title on Kaggle. He is an inspiration for most of the aspiring data scientist in our community.
Here’s what SRK has to say:
- Feature Engineering – The first and foremost important thing. We need to concentrate a lot on this since this makes a huge difference in the scores.
- Solid Validation Strategy – Without this, competitions are more or less like a gambling and so it is essential to have a proper local validation strategy. Public LB can be misleading at times.
- Ensembling / Stacking – This is an important last step which helps us cover that extra mile at the end.
Rohan Rao, DataHack Rank 5
Rohan is the Lead Data Scientist at AdWyze. He is currently positioned at Kaggle Rank 70 and holds the prestigious Kaggle Master title. He has represented and brought laurels to India in World Sudoku championships.
Here’s what Rohan has to say:
- Understand the Problem: Without understanding the problem statement, the data, the evaluation metric, most of your work is fruitless. Spend time in reading as much as possible about them. Only once you are very clear about the objective, you can proceed with exploration.I spend a good amount of time reading through and re-reading through all the available information. It usually helps me in figuring out an approach / direction before writing a single line of code.
- Summarize / Visualize Data: Data Science competitions are driven by data. It’s all about the data. Sometimes you can have a great problem statement but noisy data. Sometimes you can have really clean data but a tricky evaluation metric. Sometimes you might have a good model, but with skewed outliers. While there are huge advancements being made to automate a lot of this, there is still a lot of value in exploring data yourself. Cleaning data, handling outliers, transforming data, engineering features, etc. are all winners. I’ve found these to be major factors in Machine Learning projects.Feature engineering is the most useful output of data exploration. I believe that if you find the right and useful features, you can build a single powerful model better than any ensemble.Remember the Garbage In Garbage Out philosophy, if you input noisy/unclean data into a model, no matter how powerful the model is, it will result in noisy output.
- Validation Framework : A lot of people jump into building models by dumping data into the algorithms. While it is useful to get a sense of basic benchmarks, you need to take a step back and build a robust validation framework.Without validation, you are just shooting in the dark. You will be at the mercy of overfitting, leakage and other possible evaluation issues.By replicating the evaluation mechanism, you can make faster and better improvements by measuring your validation results along with making sure your model is robust enough to perform well on various subsets of the train/test data.
Shantanu Dutta, DataHack Rank 6
Shan is a Senior Associate at ACME. He is a self learned data scientist and specializes in BFSI and marketing. So, all this way, if you ever doubted that self learning can’t make you a data scientist, you were wrong.
Shan has participated in 37 hackathons on DataHack. He won Date Your Data and Re-date Your Data competition.
Here’s what Shan has to say.
- Understand the Data: Do not worry about needing huge amounts of compute power, it is possible to do well in these competitions with moderate setups.Understand the data and generate a hypothesis. This part is important.
- Pre-processing & Feature Engineering: Spend a considerable amount of the time in pre-processing and feature engineering. Have participated in many competitions, and it’s never the case that any dataset is perfectly clean , there’s always some sort of inherent noise in the dataset that’ll be creating hiccups in models. It may be missing values, outliers etc. Be able to visualize the data at each level of extraction will avoid many frustrations at the end.
- Algorithm Selection: Select the algorithm most suited for data. Have confidence on your handcrafted cross validation results.
Now, you have the winning potion. It’s time to test your winning habit. Use these tips in our upcoming competition.