Subscribe to DSC Newsletter

Thanks very much for this resource - I've been following along with the regression exercise to learn (or re-learn) Pandas and scikit-learn.  I wanted to pass along some suggestions around corrections to the text.  Most of these are nitpicky things like typos. Page numbers below are for the book PDF:


Page 6: "mathplotlib" should be "matplotlib"

Page 7: "that we see in Tale structures" -> "That we see in Pandas structure"

Page 11: Change "comprises of" to "comprises" or "consists of"

Page 15: Use the same dash for "Exploratory data analysis - numerical" and "Exploratory data analysis - visual"

Page 16: The outline for classification code is in fixed-width font, where regression uses normal font. Would be good to use the same font for both for consistency

Page 17: Remove the redundant second sentence:
describe() provides summary statistics on all numeric columns. describe() function gives descriptive statistics for any numeric columns using describe.

Page 19: It looks like there are two descriptions of histograms (one copied from the notebook, one reworded).

Page 22: "heatmaps for co-relation" should be "Heatmaps for correlation"

Page 22: The text in the Python notebook refers to heatmaps (and a comment in the notebook notes "We can extend this sort of analysis by creating a heatmap"). But, no heatmap is shown.  It would be good to include drawing a heatmap. The heatmap shows correlation of features with themselves (of course), as well as strong negative correlation between LSTAT and target. These lines generate a pretty heatmap with labels:
correlation = grouped_df.corr()
sns.heatmap(correlation, -1, 1)

Alternatively, this generates a fairly good-looking heatmap directly from matplotlib, without the seaborn dependency:
correlation = grouped_df.corr()
(fig, ax) = plt.subplots()
ax.set_xticks(np.arange(len(correlation)))
ax.set_yticks(np.arange(len(correlation)))
ax.set_xticklabels(correlation.columns)
ax.set_yticklabels(correlation.columns)
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
plt.imshow(correlation, cmap='hot', interpolation='nearest')
plt.show()

Page 22: The section "Analysing the target variable" is a little confusing. It could be broken up into several sentences, or generally reworked.

Page 23: Need a space after `k-nearest neighbour`

Page 24: "skicit learn" should be "scikit-learn"

Page 31: Need to correct this sentence: "R² is always between 0 and 1 or between 0% to 100%." Suggested replacement: "R² has a maximum (best) value of 1, but can be negative". (A simple example shows that it can be negative, and can even be less than -1: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r...)

Page 50: "`CHAS` faeture" should be `CHAS` feature

Page 51: For consistency with the name boston_X, I'd suggest renaming the variable boston_y to boston_Y

Page 54: "lets start by removing" should be "Let's start by removing"

Page 55: "first lets look" should be "First, let's look"

Page 55: "research waht" should be "research what"

Page 55: The section of code for re-scaling values could be simplified a bit, avoiding the need to define and call the function scale_numeric:

# a good exercise would be to research what StandardScaler does - it is from the scikit learn library
scaler = StandardScaler()
boston_X[numeric_columns] = scaler.fit_transform(boston_X[numeric_columns])

# here we can see the result
boston_X[0:10]

Page 56: "more complicate" should be "more complicated"

Page 57: "metrics are dervied" should be "metrics are derived"

Page 58: Might want to replace EDA with "exploratory data analysis (EDA)" here, since the acronym hasn't been defined in the book.

General feedback: The manuscript switches between British and American spellings for some words (normalise/normalize, optimise/optimize, standardise/standardize). Either spelling is correct, but it would look better to see consistent spelling (either British-style or American-style) throughout.

Views: 234

Replies to This Discussion

Following up - here are some other errata (or just nitpicks) with the "Classification" section:

General note: The size of the page numbers varies from page to page (for instance, page 37 has a large page number than pages 36 or 38).
I'm not sure how that happened but hopefully it's readily fixable!

Page 34: In the formula for accuracy, "FalseNegatives" should be "TrueNegatives"

Page 37:
"cross validation ex" to "cross validation, e.g." or "cross validation such as"

Page 41: It might be worth noting that the LSTAT_2 feature was introduced in the regression example (Boston house prices data-set), not the classification example (breast cancer data-set)

Page 61: In the current Python notebook, X.describe(include = 'all') is silently discarding the output. Could either print it - print(X.describe(include = 'all')) -
or break things up so that the call to describe() is the last statement in the block.

Page 61: For consistency, I'd suggest using variable names X and Y for consistency, rather than upper-case X and lower-case y
(Similarly for X_train / y_train, X_test / y_test)

Page 62: "positilvey" should be "positively"


Page 63: Suggest using np.round($PERCENTAGE,1) instead of just np.round($PERCENTAGE). The latter returns 37.0 for Y, Y_train, and Y_test; the former gives the close-but-not-identical values of 37.0, 37.4, 36.8

Page 65: For formatting, update this line:
"""### Test Alternative Models
To this:
"""### : Test Alternative Models
Or:
"""#### Test Alternative Models

Page 66: The PDF is okay, but the Python notebook switches from variable name "log_clf" to "logistic" midstream

Page 66: Change "lets do" to "let's do"

Page 66: The PDF is okay, but the Python notebook is missing final parens to call mean():
cross_val_score(rnd_clf, X, y, cv=5, scoring="accuracy").mean()

Page 67: "Indeed, Majority" to "Indeed, majority"

Page 67: Can tidy up this commented-out line of code:
# for clf in (log_clf, rnd_clf, svm_clf, voting_clf):

Page 68: "regularisation penelty" should be "regularisation penalty"

Page 68: The 'l2' penalty is not available by default for the logistic regression classifier. On my system, I needed to override the 'solver' parameter to the constructor, in order for the l2 penalty to be usable during the randomized hyperparameter search. I'm not sure why this isn't a problem on colab, but I imagine it might be using some earlier build (or different configuration).

RSS

Videos

  • Add Videos
  • View All

© 2020   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service