Published in 2013, but still very interesting, and different from most data science books. Authors: Ian Langmore and Daniel Krasner.. This book focuses more on the *statistics* end of things, while also getting readers going on (basic) programming & command line skills. It doesn't, however, really go into much of the stuff you would expect to see from the *machine learning* end of things.

*Source for picture: check page 68 in the book.*

You can download the book here. For other related books, check out our recommended reading list.

**Content**

**I Programming Prerequisites **

**1 Unix**

- History and Culture . . . . . 2
- The Shell . . . . . 3
- Streams 5
- Standard streams . . . 6
- Pipes . . . 7
- Text . . 9
- Philosophy . . . . 10
- In a nutshell . . . . . 10
- More nuts and bolts . 10
- End Notes . . . . . 11

**2 Version Control with Git**

- Background . . . . 13
- What is Git . . . . 13
- Setting Up . . . . . 14
- Online Materials . 14
- Basic Git Concepts 15
- Common Git Workflows . . . 15
- Linear Move from Working to Remote
- Discarding changes in your working copy . 17
- Erasing changes . . . 17
- Remotes . . 17
- Merge conflicts . . . . 18

**3 Building a Data Cleaning Pipeline with Python**

- Simple Shell Scripts . . . . . 19
- Template for a Python CLI Utility . . . 21

**II The Classic Regression Models**

**4 Notation**

- Notation for Structured Data 24

**5 Linear Regression**

- Introduction . . . . 26
- Coefficient Estimation: Bayesian Formulation . . . 29
- Generic setup . . . . . 29
- Ideal Gaussian World 30
- Coefficient Estimation: Optimization Formulation 33
- The least squares problem and the singular value decomposition
- Overfitting examples . 39
- L2 regularization . . . 43
- Choosing the regularization parameter . . . 44
- Numerical techniques 46
- Variable Scaling and Transformations . 47
- Simple variable scaling 48
- Linear transformations of variables . . . . . 51
- Nonlinear transformations and segmentation . . . . . 52
- Error Metrics . . . 53
- End Notes . . . . . 54

**6 Logistic Regression**

- Formulation . . . . 55
- Presenter’s viewpoint 55
- Classical viewpoint . . 56
- Data generating viewpoint . . . . 57
- Determining the regression coefficient w 58
- Multinomial logistic regression . . . . . 61
- Logistic regression for classification . . . 62
- L1 regularization . 64
- Numerical solution 66
- Gradient descent . . . 67
- Newton’s method . . . 68
- Solving the L1 regularized problem . . . . . 70
- Common numerical issues . . . . 70
- Model evaluation . 72
- End Notes . . . . . 73

**7 Models Behaving Well**

- End Notes . . . . . 75

**III Text Data**

**8 Processing Text**

- A Quick Introduction . . . . 77
- Regular Expressions . . . . . 78
- Basic Concepts . . . . 78
- Unix Command line and regular expressions 79
- Finite State Automata and PCRE . . . . . 82
- Backreference . . . . . 83
- Python RE Module 84
- The Python NLTK Library . 87
- The NLTK Corpus and Some Fun things to do . . . . 87

**IV Classification**

**9 Classification**

- Quick Introduction . . . . 90
- Naive Bayes . . . . 90
- Smoothing 93
- Measuring Accuracy . . . . . 94
- Error metrics and ROC Curves . 94
- Other classifiers . . 99
- Decision Trees . . . . 99
- Random Forest . . . . 101
- Out-of-bag classification . . . . . 102
- Maximum Entropy . . 103

**V Extras**

**10 High(er) performance Python **

- Memory hierarchy 107
- Parallelism . . . . 110
- Practical performance in Python . . . . 114
- Profiling . . 114
- Standard Python rules of thumb 117
- For loops versus BLAS 122
- Multiprocessing Pools 123
- Multiprocessing example: Stream processing text files 124
- Numba . . 129
- Cython . . 129

**DSC Resources**

- Services: Hire a Data Scientist | Search DSC | Classifieds | Find a Job
- Contributors: Post a Blog | Ask a Question
- Follow us: @DataScienceCtrl | @AnalyticBridge

Popular Articles

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Upcoming DSC Webinar**

**Your Model Will Probably Fail (And How to Prevent it)**- July 9

Data science is more popular than ever, but many data scientists struggle with complicated workflows to run their models as well as how to best communicate the output to less technical stakeholders. Tableau can solve both of these challenges by designing R workflows and creating visualizations that break complicated models down into easily understandable stories.**Register today**.

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Upcoming DSC Webinar**

**Your Model Will Probably Fail (And How to Prevent it)**- July 9

Data science is more popular than ever, but many data scientists struggle with complicated workflows to run their models as well as how to best communicate the output to less technical stakeholders. Tableau can solve both of these challenges by designing R workflows and creating visualizations that break complicated models down into easily understandable stories.**Register today**.

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central