Here's the introduction. Click here to view more details about the book.
This book is a type of “handbook” on data science and data scientists, and contains information not found in traditional statistical, programming, or computer science textbooks. The author has compiled what he considers some of the most important information you will need for a career in data science, based on his 20+ years as a leader in the field. Much of the text was initially published on the Data Science Central website over the last three years, which is read by millions of website visitors. The book shows how data science is different from related fields and the value it brings to organizations using big data.
This book has three components: a multi-layer discussion of what data science is and how it relates to other disciplines; technical applications of and for data science including tutorials and case studies; and career resources for practicing and aspiring data scientists. Numerous career and training resources are included (such as data sets, web crawler source code, data videos, and how to build API’s) so you can start practicing data science today and quickly boost your career. For decision makers, you will find information to help you make decisions on how to build a better analytic team, whether and when you need specialized solutions, and which ones will work best for your need.
Who This Book Is For
This book is intended for data scientists and related professionals (such as business analysts, computer scientists, software engineers, data engineers and statisticians) who are interested in shifting to big data science careers. It is also for the college student studying a quantitative curriculum with the goal of becoming a future data scientist. Finally, it is for managers of data scientists, and people interested in creating a startup business or consultancy around data science.
These readers will find valuable information throughout the book, and specifically in the following chapters:
What This Book Covers
The technical part of this book covers core data science topics including:
The focus is on recent technology. So you will not find material about old techniques such as linear regression, except anecdotal references. These are discussed at lengths in all standard books. Actually, there is some limited discussions on logistic-like regressions in this book, but it’s more about blending it with other classifiers, and proposing a numerically stable, approximate algorithm; we mention that approximate solutions are often as good as the exact model, as no data fits perfectly with a theoretical model.
Besides technology, the book provides useful career resources, including job interview questions (some are technical, some are not). Another important part is cases studies. Some have a statistical/machine learning flair, some have more of a business/decision science or operations research flair, and some have more of a data engineering flair.
Most of the time, I have favored topics that were posted recently and very popular on Data Science Central (the leading community for data scientists), rather than topics that I am particularly attached to,
How This Book Is Structured
The book consists of three main sets of topics:
The book provides valuable career resources for potential and existing data scientists and related professionals (and their managers and their bosses), and generally speaking, to all professionals dealing with increasingly bigger, more complex, and faster flowing data. The book also provides data science recipes, craftsmanship, concepts (many times, original and published for the first time), and cases studies illustrating implementation methods and techniques that have been proven successful in various domains for analyzing modern data — either manually or automatically
What You Need to Use This Book
The book contains few sample code, either in R or Perl. You can download Perl from http://www.activestate.com/activeperl/downloads and R from http://cran.r-project.org/bin/windows/base/. If you use a Windows machine, I would first install Cygwin, a Linux-like environment for windows. You can get Cygwin at http://cygwin.com/install.html. Python is also available as open source and has a useful library called Pandas.
For most of the book, 1 or 2 years of college with some basic quantitative courses is enough for you to understand the content. The book does not require calculus or advanced math — indeed, it barely contains any mathematical formulas or symbols.
Yet some quite advanced material is described at a high level. A few technical notes spread throughout the book are for those who are more mathematically inclined and interested in digging deeper. Two years of calculus, statistics, and matrix theory at the college level is needed to understand these technical notes. Some source code (R, Perl) and datasets are provided, but the emphasis is not on coding.
This mixture of technical levels offers the opportunity for you to explore the depths of data science without advanced math knowledge. (A bit like the way Carl Sagan introduced astronomy to the mainstream public.)