Here's the introduction. Click here to view more details about the book.


This book is a type of “handbook” on data science and data scientists, and contains information not found in traditional statistical, programming, or computer science textbooks. The author has compiled what he considers some of the most important information you will need for a career in data science, based on his 20+ years as a leader in the field. Much of the text was initially published on the Data Science Central website over the last three years, which is read by millions of website visitors. The book shows how data science is different from related fields and the value it brings to organizations using big data.

This book has three components: a multi-layer discussion of what data science is and how it relates to other disciplines; technical applications of and for data science including tutorials and case studies; and career resources for practicing and aspiring data scientists. Numerous career and training resources are included (such as data sets, web crawler source code, data videos, and how to build API’s) so you can start practicing data science today and quickly boost your career. For decision makers, you will find information to help you make decisions on how to build a better analytic team, whether and when you need specialized solutions, and which ones will work best for your need.

Who This Book Is For

This book is intended for data scientists and related professionals (such as business analysts, computer scientists, software engineers, data engineers and statisticians) who are interested in shifting to big data science careers. It is also for the college student studying a quantitative curriculum with the goal of becoming a future data scientist. Finally, it is for managers of data scientists, and people interested in creating a startup business or consultancy around data science.

These readers will find valuable information throughout the book, and specifically in the following chapters:

  • Data science practitioners will find Chapters 2, 4, 5, and 6 particularly valuable to current data scientists because these chapters contain material on big data techniques (clustering and taxonomy creation, with caveats), modern data science technique such as combinatorial feature selection, hidden decision trees, analytic API’s, or when Map-Reduce is useful. A number of case studies (fraud detection, digital analytics, stock market strategies and more) are detailed enough to allow the reader to replicate these analyses when facing similar data in the real world, when doing their job. However it is also explained in simple worlds, not spending too much time on technicalities, code or formulas, to make it accessible to high level managers.
  • Students attending computer science, data science, or MBA classes will find Chapters 2, 4, 5, and 6 valuable for their purposes. In particular, they will find more advanced material in Chapters 2, 4, and 5, such as practical data science methods and principles, most of it not found in textbooks or taught in typical college curricula. Chapter 6 also provides real life applications and case studies, including more in-depth technical details.
  • Job applicants will find resources about data science training and programs in Chapter 3. Chapters 7 and 8 provide numerous resources for job seekers including interview questions, sample resumes, sample job ads, a list of companies that routinely hire data scientists, and salary surveys.
  • Entrepreneurs who want to launch a data science startup or consultancy will find sample business proposals, startup ideas, and salary surveys for consultants in Chapter 3. Also, throughout the book, consultants will find discussions on improving communication in data science work, lifecycles of data science projects, book and conference references, and many other resources.
  • Executives trying to assess the value of data science, where it most benefits enterprise projects, and when architectures such as Map-Reduce are useful, will find valuable information in Chapters 1, 2, 6 (case studies) and 8 (sample job ads, resumes, salary surveys). The focus of these chapters is usually not technical, except, to a limited extent, in some parts of Chapters 2 and 6, where new analytic technologies are introduced.

What This Book Covers

The technical part of this book covers core data science topics including:

  • Big data and the challenges of applying traditional algorithms to big data (solutions are provided, for instance in the context of big data clustering or taxonomy creation),
  • a new, simplified, data science friendly  approach to statistical science, focusing on robust, model-free methods
  • State-of-the-art machine learning (hidden decision trees, combinatorial feature selection)
  • New metrics for modern data (synthetic metrics, predictive power, bumpiness coefficient)
  • Elements of computer science needed to build fast algorithms
  • Map-Reduce and Hadoop, including numerical stability of computations performed with Hadoop (last section of the book)

The focus is on recent technology. So you will not find material about old techniques such as linear regression, except anecdotal references. These are discussed at lengths in all standard books. Actually, there is some limited discussions on logistic-like regressions in this book, but it’s more about blending it with other classifiers, and proposing a numerically stable, approximate algorithm; we mention that approximate solutions are often as good as the exact model, as no data fits perfectly with a theoretical model.

Besides technology, the book provides useful career resources, including job interview questions (some are technical, some are not). Another important part is cases studies. Some have a statistical/machine learning flair, some have more of a business/decision science or operations research flair, and some have more of a data engineering flair.

Most of the time, I have favored topics that were posted recently and very popular on Data Science Central (the leading community for data scientists), rather than topics that I am particularly attached to,

How This Book Is Structured

The book consists of three main sets of topics:

  • What data science and big data is, and is not, and how it’s different from other disciplines (Chapters 1, 2, and 3, and the beginning of Chapter 4).
  • Career and training resources (Chapters 3 and 8)
  • Technical material presented as tutorials (Chapters 4 and 5, but also the section on Clustering and Taxonomy Creation for Massive Datasets in Chapter 2, and the section on New Variance for Hadoop and Big Data in Chapter 8), and in case studies (Chapter 7).

The book provides valuable career resources for potential and existing data scientists and related professionals (and their managers and their bosses), and generally speaking, to all professionals dealing with increasingly bigger, more complex, and faster flowing data. The book also provides data science recipes, craftsmanship, concepts (many times, original and published for the first time), and cases studies illustrating implementation methods and techniques that have been proven successful in various domains for analyzing modern data — either manually or automatically

What You Need to Use This Book

The book contains few sample code, either in R or Perl. You can download Perl from http://www.activestate.com/activeperl/downloads and R from http://cran.r-project.org/bin/windows/base/. If you use a Windows machine, I would first install Cygwin, a Linux-like environment for windows. You can get Cygwin at http://cygwin.com/install.html. Python is also available as open source and has a useful library called Pandas.

For most of the book, 1 or 2 years of college with some basic quantitative courses is enough for you to understand the content. The book does not require calculus or advanced math — indeed, it barely contains any mathematical formulas or symbols.

Yet some quite advanced material is described at a high level. A few technical notes spread throughout the book are for those who are more mathematically inclined and interested in digging deeper. Two years of calculus, statistics, and matrix theory at the college level is needed to understand these technical notes. Some source code (R, Perl) and datasets are provided, but the emphasis is not on coding.

This mixture of technical levels offers the opportunity for you to explore the depths of data science without advanced math knowledge. (A bit like the way Carl Sagan introduced astronomy to the mainstream public.)


Views: 16377


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Carroll Patton on January 10, 2018 at 6:57am

With all of this said, what is the name of your book?

Comment by Sree Ranjini Srivathsala on December 10, 2015 at 7:38am

I am interested in taking the DSA. According to the website, I need read the book and work on the project.

But can someone tell me how can I publish the project work, that is not clear.

Highly appreciate if someone can guide on this, if you have taken this before.

Thanks & Regards,


Comment by Ananth Ananthapuram on March 26, 2014 at 7:36pm

Congratulations Vincent!

I have read the 1st chapter (sample) in Amazon and very curious to read the rest of the book, hence pre-ordered.  

Looking forward to the apprenticeship training.

Comment by Mahadevaswamy P L on March 21, 2014 at 2:18pm
Waiting for march 31st..
Comment by Pete Charette on March 6, 2014 at 4:17am

I pre-ordered this as well. Looking forward to the apprenticeship. 

Comment by Anzar Hasan on March 4, 2014 at 12:47pm

Waiting  for March 31st....

Comment by Themos Kalafatis on March 4, 2014 at 9:40am

Congratulations Vincent, i found many potentially interesting chapters…Looking forward to reading it. 

Comment by Phil Simon on March 4, 2014 at 3:16am

Congratulations on the book!

Comment by Pawan Kumar on March 3, 2014 at 8:31am

I have already pre-ordered this book on Amazon. 

Comment by Rukshan Siriwardhane on March 2, 2014 at 6:46am

Thanks. Waiting to grab a copy.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service