Subscribe to DSC Newsletter

Top Mistakes Developers Make When Using Python for Big Data Analytics

Interesting article by Karolina Alexiou.

Regarding mistake #1, I disagree. I do it all the time, and it's faster than finding, understanding, and fine-tuning a piece of code that will work for you, unless you are looking for something basic such as computing correlations for weighted observations. If you are good, your reinvented wheel will be better, implemented faster, more robust, and more customized than existing ones.

Actually, these mistakes below apply to any language, not just Python.

Here's the original list:

  1. Reinventing the wheel Not tuning for performance
  2. Not tuning for performance
  3. Not understanding time and timezones
  4. Manual integration with heavier technologies or other scripts
  5. Not keeping track of data types & schemata
  6. No data provenance tracking
  7. No (regression) testing

I would also add:

  1. Not refreshing lookup tables with the correct frequency
  2. Writing obfuscated code
  3. Ignoring special characters and formatting, especially for imported data, in NLP applications
  4. Running in collisions with hash tables indexes
  5. Generated files used by multiple users at the same time, but with no file lock
  6. Not creating a data dictionary
  7. Creating too big, too small, or non-optimized lookup tables
  8. Creating hash tables that mix many small values with a few extremely long ones
  9. Writing code that can't easily be restarted (with one click) when crashing, with little or no data loss
  10. Poor error handling
  11. Poor encapsulation
  12. Poor load balance when splitting a task via a Map-Reduce framework
  13. Poor joins, too many joins, joins that are not optimized
  14. Not out-putting a detailed log (with time stamps) showing progress when code is run (to help with debugging and fine tuning performance)
  15. Meaningless or poor variables or file names 
  16. System calls (to be avoided), and poor process / scheduling management
  17. Poor memory management (not de-allocating all memory as needed)
  18. Errors with pointers

Read full article

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 7207

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by jaap Karman on February 21, 2015 at 5:15am

All api's are some kind of system call. Unless you are are some kind of superhero capable of doing the impossible than you are using : 16 System calls (NOT to be avoided)

Understanding the System or having you guide how to use it is a failure when not doing so.

NEW 16 Ignoring the system your are using and what is technical about how to use that in a correct way.

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service