Home » Uncategorized

Free eBook: Applied Data Science (Columbia University)

Published in 2013, but still very interesting, and different from most data science books. Authors: Ian Langmore and Daniel Krasner.. This book focuses more on the statistics end of things, while also getting readers going on (basic) programming & command line skills. It doesn’t, however, really go into much of the stuff you would expect to see from the machine learning end of things. 

Free eBook: Applied Data Science (Columbia University)

Source for picture: check page 68 in the book.

You can download the book here. For other related books, check out our recommended reading list.

Content

I Programming Prerequisites 

1 Unix 

  • History and Culture . . . . . 2
  • The Shell . . . . . 3
  • Streams 5
  • Standard streams . . . 6
  • Pipes . . . 7
  • Text . . 9
  • Philosophy . . . . 10
  • In a nutshell . . . . . 10
  • More nuts and bolts . 10
  • End Notes . . . . . 11

2 Version Control with Git 

  • Background . . . . 13
  • What is Git . . . . 13
  • Setting Up . . . . . 14
  • Online Materials . 14
  • Basic Git Concepts 15
  • Common Git Workflows . . . 15
  • Linear Move from Working to Remote
  • Discarding changes in your working copy . 17
  • Erasing changes . . . 17
  • Remotes . . 17
  • Merge conflicts . . . . 18

3 Building a Data Cleaning Pipeline with Python

  • Simple Shell Scripts . . . . . 19
  • Template for a Python CLI Utility . . . 21

II The Classic Regression Models

4 Notation

  • Notation for Structured Data 24

5 Linear Regression

  • Introduction . . . . 26
  • Coefficient Estimation: Bayesian Formulation . . . 29
  • Generic setup . . . . . 29
  • Ideal Gaussian World 30
  • Coefficient Estimation: Optimization Formulation 33
  • The least squares problem and the singular value decomposition
  • Overfitting examples . 39
  • L2 regularization . . . 43
  • Choosing the regularization parameter . . . 44
  • Numerical techniques 46
  • Variable Scaling and Transformations . 47
  • Simple variable scaling 48
  • Linear transformations of variables . . . . . 51
  • Nonlinear transformations and segmentation . . . . . 52
  • Error Metrics . . . 53
  • End Notes . . . . . 54

6 Logistic Regression

  • Formulation . . . . 55
  • Presenter’s viewpoint 55
  • Classical viewpoint . . 56
  • Data generating viewpoint . . . . 57
  • Determining the regression coefficient w 58
  • Multinomial logistic regression . . . . . 61
  • Logistic regression for classification . . . 62
  • L1 regularization . 64
  • Numerical solution 66
  • Gradient descent . . . 67
  • Newton’s method . . . 68
  • Solving the L1 regularized problem . . . . . 70
  • Common numerical issues . . . . 70
  • Model evaluation . 72
  • End Notes . . . . . 73

7 Models Behaving Well

  • End Notes . . . . . 75

III Text Data

8 Processing Text

  • A Quick Introduction . . . . 77
  • Regular Expressions . . . . . 78
  • Basic Concepts . . . . 78
  • Unix Command line and regular expressions 79
  • Finite State Automata and PCRE . . . . . 82
  • Backreference . . . . . 83
  • Python RE Module 84
  • The Python NLTK Library . 87
  • The NLTK Corpus and Some Fun things to do . . . . 87

IV Classification

9 Classification

  • Quick Introduction . . . . 90
  • Naive Bayes . . . . 90
  • Smoothing 93
  • Measuring Accuracy . . . . . 94
  • Error metrics and ROC Curves . 94
  • Other classifiers . . 99
  • Decision Trees . . . . 99
  • Random Forest . . . . 101
  • Out-of-bag classification . . . . . 102
  • Maximum Entropy . . 103

V Extras

10 High(er) performance Python 

  • Memory hierarchy 107
  • Parallelism . . . . 110
  • Practical performance in Python . . . . 114
  • Profiling . . 114
  • Standard Python rules of thumb 117
  • For loops versus BLAS 122
  • Multiprocessing Pools 123
  • Multiprocessing example: Stream processing text files 124
  • Numba . . 129
  • Cython . . 129

DSC Resources

Popular Articles