Here I compare these 5 rules published in 1999, with the new 2014 version. Data has changed so much that the opposite rules are now followed. Yet many statisticians and big businesses still stick to the outdated rules.
These rules were initially published in the featured book (see picture) first published in 1999, when software (e.g. SPSS) could not adapt to data, but data had to adapt to software.
This book, published in 1999, is priced at $72 on Amazon. The new version published in 2012 is $59. Maybe the 1999 version is considered an antiquity, and thus commands a higher price. But in my opinion, these prices don't reflect demand, and are not determined by the market, but rather by production costs. I bought the 1999 version for under $10, as a used book.
The 5 rules, in 1999
From the book, pages 15-18.
- All data must be numeric
- Each variable must occupy the same location for each case
- All codes for all variables must be mutually exclusive
- Each variable should contain maximum information
- For each case, there should be a numeric code for every variable
Failing to comply with these specifications would make software to crash or behave erratically. It was also a time when variable names could not be longer than 8 characters (SPSS requirement) and were usually named VAR001, VAR002 and so on. Now, descriptive, precise names are preferred. Imagine a data set with one million variables!
Updated 5 rules, 2014
- Most data is not numeric, and not formatted (raw text such as user reviews)
- Fixed-length format is now obsolete: it uses way too much space, when a field has a size that varies between 0 and 100 kilobytes, or could be an image. Besides, using spaces as field separators is a terrible idea, as field values contain spaces themselves.
- Non exclusive codes provide for a richer data set. Should someone be either Asian or Hispanic? Why not both?
- Don't ask or collect too detailed information, because of privacy issues. Better have 3 age groups rather than asking the exact age (people will lie today). Also, any survey question should allow for answers such as "not available", "other", "do not want to answer".
- Numeric codes is not a good idea, when the number of potential cases is very large (big data), and you may run out of space if you use too few bytes to store them. It also makes clustering of cases more difficult. Use a good tagging system instead, and text-based tagging or coding is good, and will help business analysts work more efficiently.
Other than that, the book is still interesting. Back then, the two big statistical procedures were ANOVA and regression, according to the author (see page 197). Yet it contains a section with advice that still applies today (page 290):
- Get comfortable with your data
- Thoroughly explore your data, twice!
- Sometimes pictures speak louder than words
- Replication is under-emphasized and overdue
- Remember the difference between statistical significance and substantive significance
- Remember the difference between statistical significance and effect size
- Statistics do not speak for themselves
- Keep it simple when possible
- Use consultants
- Don't be too hard on yourself
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge