Thanks very much for this resource - I've been following along with the regression exercise to learn (or re-learn) Pandas and scikit-learn. I wanted to pass along some suggestions around corrections to the text. Most of these are nitpicky things like typos. Page numbers below are for the book PDF:
Page 6: "mathplotlib" should be "matplotlib"
Page 7: "that we see in Tale structures" -> "That we see in Pandas structure"
Page 11: Change "comprises of" to "comprises" or "consists of"
Page 15: Use the same dash for "Exploratory data analysis - numerical" and "Exploratory data analysis - visual"
Page 16: The outline for classification code is in fixed-width font, where regression uses normal font. Would be good to use the same font for both for consistency
Page 17: Remove the redundant second sentence:
describe() provides summary statistics on all numeric columns. describe() function gives descriptive statistics for any numeric columns using describe.
Page 19: It looks like there are two descriptions of histograms (one copied from the notebook, one reworded).
Page 22: "heatmaps for co-relation" should be "Heatmaps for correlation"
Page 22: The text in the Python notebook refers to heatmaps (and a comment in the notebook notes "We can extend this sort of analysis by creating a heatmap"). But, no heatmap is shown. It would be good to include drawing a heatmap. The heatmap shows correlation of features with themselves (of course), as well as strong negative correlation between LSTAT and target. These lines generate a pretty heatmap with labels:
correlation = grouped_df.corr()
sns.heatmap(correlation, -1, 1)
Alternatively, this generates a fairly good-looking heatmap directly from matplotlib, without the seaborn dependency:
correlation = grouped_df.corr()
(fig, ax) = plt.subplots()
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
plt.imshow(correlation, cmap='hot', interpolation='nearest')
Page 22: The section "Analysing the target variable" is a little confusing. It could be broken up into several sentences, or generally reworked.
Page 23: Need a space after `k-nearest neighbour`
Page 24: "skicit learn" should be "scikit-learn"
Page 31: Need to correct this sentence: "R² is always between 0 and 1 or between 0% to 100%." Suggested replacement: "R² has a maximum (best) value of 1, but can be negative". (A simple example shows that it can be negative, and can even be less than -1: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r...)
Page 50: "`CHAS` faeture" should be `CHAS` feature
Page 51: For consistency with the name boston_X, I'd suggest renaming the variable boston_y to boston_Y
Page 54: "lets start by removing" should be "Let's start by removing"
Page 55: "first lets look" should be "First, let's look"
Page 55: "research waht" should be "research what"
Page 55: The section of code for re-scaling values could be simplified a bit, avoiding the need to define and call the function scale_numeric:
# a good exercise would be to research what StandardScaler does - it is from the scikit learn library
scaler = StandardScaler()
boston_X[numeric_columns] = scaler.fit_transform(boston_X[numeric_columns])
# here we can see the result
Page 56: "more complicate" should be "more complicated"
Page 57: "metrics are dervied" should be "metrics are derived"
Page 58: Might want to replace EDA with "exploratory data analysis (EDA)" here, since the acronym hasn't been defined in the book.
General feedback: The manuscript switches between British and American spellings for some words (normalise/normalize, optimise/optimize, standardise/standardize). Either spelling is correct, but it would look better to see consistent spelling (either British-style or American-style) throughout.
Following up - here are some other errata (or just nitpicks) with the "Classification" section:
General note: The size of the page numbers varies from page to page (for instance, page 37 has a large page number than pages 36 or 38).
I'm not sure how that happened but hopefully it's readily fixable!
Page 34: In the formula for accuracy, "FalseNegatives" should be "TrueNegatives"
"cross validation ex" to "cross validation, e.g." or "cross validation such as"
Page 41: It might be worth noting that the LSTAT_2 feature was introduced in the regression example (Boston house prices data-set), not the classification example (breast cancer data-set)
Page 61: In the current Python notebook, X.describe(include = 'all') is silently discarding the output. Could either print it - print(X.describe(include = 'all')) -
or break things up so that the call to describe() is the last statement in the block.
Page 61: For consistency, I'd suggest using variable names X and Y for consistency, rather than upper-case X and lower-case y
(Similarly for X_train / y_train, X_test / y_test)
Page 62: "positilvey" should be "positively"
Page 63: Suggest using np.round($PERCENTAGE,1) instead of just np.round($PERCENTAGE). The latter returns 37.0 for Y, Y_train, and Y_test; the former gives the close-but-not-identical values of 37.0, 37.4, 36.8
Page 65: For formatting, update this line:
"""### Test Alternative Models
"""### : Test Alternative Models
"""#### Test Alternative Models
Page 66: The PDF is okay, but the Python notebook switches from variable name "log_clf" to "logistic" midstream
Page 66: Change "lets do" to "let's do"
Page 66: The PDF is okay, but the Python notebook is missing final parens to call mean():
cross_val_score(rnd_clf, X, y, cv=5, scoring="accuracy").mean()
Page 67: "Indeed, Majority" to "Indeed, majority"
Page 67: Can tidy up this commented-out line of code:
# for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
Page 68: "regularisation penelty" should be "regularisation penalty"
Page 68: The 'l2' penalty is not available by default for the logistic regression classifier. On my system, I needed to override the 'solver' parameter to the constructor, in order for the l2 penalty to be usable during the randomized hyperparameter search. I'm not sure why this isn't a problem on colab, but I imagine it might be using some earlier build (or different configuration).