Subscribe to Dr. Granville's Weekly Digest

I will update this article regularly. An old version can be found here and has many interesting links. All the material presented here is not in the old version. This article is divided into 11 sections.

1. Hardware

A laptop is the ideal device. I've been using Windows laptops for years, and I always installed a Linux layer (acting as an operating system on top of Windows), known as Cygwin. This way, you get the benefits of having Windows (Excel, Word, compatibility with clients and employers, many apps such as FileZilla) together with the flexibility and pleasure of working with Linux. Note that Linux is a particular version of UNIX. So the first recommended step (to start your data science journey) is to get a modern Windows laptop (under $1,000) and install Cygwin.

Even if you work heavily on the cloud (AWS, or in my case, access to a few remote servers mostly to store data, receive data from clients and backups), your laptop is you core device to connect to all external services (via the Internet). Don't forget to do regular backups of important files, using serives such as DropBox.

2. Linux environment on Windows laptop

Once you installed Cygwin, you can type commands or execute programs in the Cygwin console. Here's how the console looks like on my laptop:

 Figure 1: Cygwin (Linux) console on Windows laptop

You can open multuple Cygwin windows on your screen(s).

To connect to an external server for file transfers, I use the Windows FileZilla freeware rather than the command-line ftp offered by Cygwin. If you need full privileges on the remote machine, use Putty instead (for Telnet/SSH sessions).

You can run commands in the background using the & operator. For instance,

$ notepad VR3.txt &

will launch Notepad (the standard Windows text editor) from the Cygwin console, into another window, and open the file VR3.txt located in your local directory (if this file exists in that directory). Note the $ symbol preceding any command (see Figure 1). In addition, the console also displays the username (Vincent@Inspiron-Switch in my case) as well as the directory I'm in (/cygdrive/c/vincentg/ in Linux, corresponding to the c://vincentg/ pathname under windows). Basic operations:

  • Changing directory is performed with the command cd (examples: cd subfolder/subsubfolder, cd .. to go one level above, cd . to go to your home directory) 
  • Listing content of directory is done with command ls -l (note that -l is an command argument used to specify that want a full, detailed listing; without this option, the listing shown in Figure 1 would be far less detailed).
  • If you don't know your local directory, type in the command pwd, it will tell you your location (path)

So far you've learned the following Linux concepts: command line and symbol $ (sometimes replaced by > depending on the Linux version), operator & (for background processing), paths, commands  cd pwd, and ls, command options (-l for ls) and shortcuts (. and .. for the cd command).

A few more things about files

Files have an extension that indicates what kind of file it is (text, image,spreadsheet) and what sofware can open and process them. In Figure 1, VR3.txt has the .txt extension, meaning it's a text file - the most widespread type of data file. There are two types of files: binary (used by various programs; compressed/encrypted format) and text (can be processed easily by any program or editor). It is important to know the distinction when doing FTP file transfers (FTP clients allow you to specify the type of file, though it's automated and transparent to the user with FileZilla).

Other extensions include

  • .csv (comma-separated text file that you can open with Excel or Notepad; it can have more than 1 million rows),
  • .xlsx (Excel files limited to 1 million rows, this is a binary file), 
  • .gz (compressed files, thus binary files),
  • .png (best image format, other image formats include .gif, .jpg, .jpeg, and .bmp; these are binary files),
  • .docx (Word documents; binary),
  • .html (text files representing source code of a web page),
  • .sql (text file used to store an SQL query, used as input for some database clients such as Brio),
  • .php (PHP code, text format),
  • .pl (Perl code, text format),
  • .js (Javascript code, text format),
  • .r (R code, text format),
  • .py (Python code, text format),
  • .c (C code, text format),
  • .exe (Windows executable),
  • .xml (XML, text format for name-value pairs) 

Files are not stored exactly the same way in Windows and UNIX. Also, some systems use UNICODE for file encoding, which takes much more space but allow you e.g. to work with Chinese characters (stored using two bytes per character). When processing such a file (they are rather rare fortunately), you'll first need to clean it and standardize it to traditional ASCII (one byte = one character). 

Finally, the best text format that you can use is tab separated: each column or field is separated by a TAB, an invisible char represented by \t in some programming languages. The reason is that some fields contain commas, and thus using csv (comma-separated text files) results in broken fields and data that looks like garbage, and is hard to process (requiring a laborious cleaning step first, or talking to your client to receive tab-separated format instead).

When processing data, the first step is to produce a data dictionary. It is easily done using a scripting language (see section 4).

File management

Filenames should be designed carefully (no space or special char in a filename), especially when you have thousands or millions of files across thousands of directories and sub-directories, and across dozens of servers (the cloud). One of the two core components of Hadoop is actually its file management system, known as HDFS (the other component being the distributed Map-Reduce architecture to process tasks).

It's always a good idea to always have a time stamp embedded into the file name, representing the creation date. Note that in Figure 1, the files all start with VR, an abbreviation for Vertical Response, as these files are coming or related to our email service provider, called Vertical Response. File names should be very detailed: keep in mind that sooner rather than later, you might run scripts to process millions of them. Without proper naming conventions, this task will be impossible. 

A final word, if you look at Figure 1, the first column indicates who can read (r), re-write (w) or execute (x) these files, besides me. It's never been an issue on Windows for me, but on a true UNIX operating system (not Cygwin), you might want to set the right protections: for example Perl scripts (despite being text) must be set to Executable, with the UNIX command chmod 755 filename.pl, where filename.pl is your Perl script. File protections (and locks) are important for organisations where files can be shared by many users, sometimes simultaneously.

3. Basic UNIX commands

You don't need to spend hours learning UNIX and buy 800-page books on the subject. The following commands will get you started, once you have your Cygwin console:

  • cd, pwd, ls (see section 2)
  • tail -100, head -150 to extract the last 100 or first 150 rows of a file
  • cp, mv, mkdir, rmdir respectively copy a file to another location, rename a file, create a new directory or delete a directory (you need to erase all files first)
  • sort, uniq respectively sort a file and remove dupicate entries (you can sort alphabetically or numerically depending on the option; default is alphabetical order)
  • gzip: compress/uncompress files
  • wc: count number of rows and words in a text file
  • grep: identify all rows containing a specific string in a text file (it helps to be familiar with regular expression)
  • cat: display content of text file on your screen
  • chmod: change file protections, see section 2
  • history: lists the last commands you used, as it is very common to re-use the same commands all the time.
  • cron, crontab: to automatically schedule tasks (running an executable once a day)

Operators include > (to save output to a new file), >> to append output to an existing file, | (the pipe operator, see examples), & (see section 2, used for background or batch mode when executing a command), * (see examples) and ! (see example) 

Examples 

  • sort filename.txt | uni -c > results.txt (sort filename.txt alphabetically - not numerically - then remove duplicates, and for each final entry count number of duplicates with option -c; store results in results.txt )
  • rm -i test*.txt (remove all files starting with test and with extension .txt; the extension -i is to request manual confirmation before each file gets deleted)
  • grep  'abc' test.txt | wc (extract all rows containing abs in test.txt, then count these rows with wc)
  • !545 (run command #545, after you run the command history to get the lists of previously entered commands)

Check out details for these commands (exact syntax and options).

Miscellaneous

Shell scripts (or batch files) are small programs that execute a list of commands, and can be run in batch mode. For regular expressions, see section 4.

4. Scripting language

You can get started in data science wth just a few Unix commands, a tool for statistical analyses such as R (unless you write your own algorithms to get more robust and simple tools) and a scripting programming language such as Perl or Python. Python (together with Pandas libraries) is the most popular language for data science. Python and machine learning resources are provided later in this article. This article is a good introduction on Python for data science. This reference has tons of resources about Python for data science.

Here I describe fundamental features of Perl, but they apply to all scripting languages. You can download Perl from ActiveState. Numerous short programs (Perl, but also R), easy to read and understand, can be found here. Perl scripts are text files with a .pl extension (say myprogram.pl) that you can execute in the Cygwin console with the command line perl myprogram.pl once you have installed Perl on your laptop.

Our choice of Perl in this tutorial is based on its ease of use:

  • You don't need to worry about memory allocation, until you create hash tables with 10+ million rows, that is, when you process data sets with hundreds of millions of rows - and even then, there are simple Hadoop-style workarounds
  • Perl is very flexible, allowing you to think about your algorithm, rather than the code, when you write a program: coding in Perl is as easy as writing a paper
  • No rigid syntax: I've never seen a language causing so few errors. In short, it means very little time spent debugging code; you don't even need to understand object-oriented programming (OOP) to code in Perl, though writing re-usable functions (stored in home-made Perl libraries) with encapsulated code (private variables) and mandatory arguments for external calls, is a very good programming practice (and core concepts of OOP).
  • Very little overhead: no need to declare all variables (thought it is a good idea to do so) or worry about data types - type casting is automated; this allows for very fast prototyping and testing of new algorithms

In short, it's a great language for learning data science, though not so great if you work in a big team and have to share, integrate and update various pieces of Perl code from various coders. Nevertheless, Perl is very powerful, can be blended with other languages (all languages can nowadays), and I still perform all my consulting with Perl. Perl used to be the only language with great string processing functions, and able to handle regular expressions easily - an advantage over other languages, for text processing or text mining. However, other languages have caught up, and Python is now just as good.

Perl is an interpreted langage, which means that you don't need to compile Perl programs. This can potentially slow down executation a little bit, but in my experience, most of what I developed in Perl runs 10 to 100 times faster (without loss of accuracy) than what I've seen in the corporate world, mostly thanks to developing better algorithms and using fewer (but better, more predictive) metrics, and fewer observations (samples). These algorithms are listed at the bottom of this article - an example (in the context of feature selection) is testing dozens of features at once rather than in parralel, using smaller samples thanks to better use of data science.

Core elements of scripting languages

Some basic stuff that is used in pretty much any programs include

  • Hash tables are lists of name-value pairs, where insertion or deletion of an element is very fast. They can be descibed as arrays indexed by strings, and constitute a powerful, fundamental data structure. They can be used to produce efficient joins.  See our data dictionary article for a simple illustration. Hash tables store data using a syntax such as $myhash{"Vincent Granville|Data Scientist"} = "yes"; In this case the index is bi-dimensional and is made up of the name and job title; the value is "yes" or "no". If the name or job title is not in your data, no entry is created (that's why this data structure produces efficient joins). See also this article on feature selection, for a more sophisticated application.
  • Associative arrays are just hash tables: arrays indexed by strings rather than integers. In Perl, they are declared using %myhash=() while regular arrays are declared using @myarray=(). Memory allocation for hash tables is automated in Perl. However, you should create a variable $myhashsize that is incremented by 1 each time an entry is added to %myhash (or decremented by 1 in case of deletion). This way, you know how big your hash tables grow. If your program displays (on the screen) the exact time every 300,000 newly created hash entries, you'll have an idea when you run out of memory: at that moment, your Perl script suddenly starts running 20 times slower. When this happens, it's time to think about optimization using Hadoop or Map-Reduce.
  • String processing and regular expressions: the sample code below contains basic strings substitution including special characters (\n, \:).  Many substitutions can be performed in just one tiny bit of code using regular expressions, click here or here for details. One of the most widespread opreations is to split a text $text into elements stored in an array @myarray; the syntax is @myarray = split(/\t/,$text); Here we assume that text elements are separated by TABs (the special character \t). The number of text elements is stored in the variable $#myarray.

The easiest way to learn how to code is to look at simple, well written sample programs of increasing complexity, and become an expert in Google search to find solutions to coding questions - many answers can be found on StackOverflow. I have learned R, SQL C, C++ and Perl that way, without attending any training. If you need training, read the section on training in chapter 2 in my book or check out this list of courses. The following are good examples of code to get you started.

Sample scripts to get you started, and even make money!

Here are some sample code. The more stars, the more difficult.

Below is a simple script that performs automated dns lookups to extract domain names associated with IP addresses. The input file is a list of IP addresses (ips.txt) and the output file is a text file outip.txt with two fields, tab-separated: IP address and domain name. A tempory file titi.txt is created each time we call the external Cygwin command 'nslookup'. Note that $ is used for variables. There's some basic string processing here, for instance:  $ip=~s/\n//g substitutes each carriage return / line feed (special character \n) by nothing (empty) in the variable $ip. Note that a # means that what follows (in the line in uestion) is a comment, not code.

`rm titi.txt`;
# $ip="107.2.111.109";

open(IN,"<ips.txt");
open (OUT,">outip.txt");
while ($lu=<IN>) {
  $ip=$lu;
  $n++;
  $ip=~s/\n//g;
  if ($ip eq "") { $ip="na"; }
  `nslookup $ip | grep Name > titi.txt`;

  open(TMP,"<titi.txt");
  $x="n/a";
  while ($i=<TMP>) {
    $n++;
    $i=~s/\n//g;
    $i=~s/Name\://g;
    $x=$i;
  }
  close(TMP);
  print OUT "$ip\t$x\n";
  print "$n> $ip | $x\n";

  sleep(0);
}
close(OUT);
close(IN);

Now, you can download big logfiles for free (see section 10), extract IP addresses and traffic statistics per IP address, and run the above script (using a distributed architecture, with 20 copies of your script running on your laptop) to extract domain names attached to IP addresses. Then you can write a program to map each IP address to an IP category using the technique described in my article Internet Topology Mapping. And finally, sell or license the final data to clients.

Other important concepts

Three concepts:

  • Functions in Perl are declared using the subroutine reserved keyword. A few exmples are found in the sample scripts. Learn how to pass an argument that is a variable, an array or an hash table. Subroutines can return more than one value. Use of global variables is discouraged, but with proper care (naming conventions), you can do it without risks.
  • You can write prgrams that accept command-line arguments. Google 'Perl command-line arguments' for details.
  • Libraries (home-made of external) require an inclusion directive, such as require LWP::UserAgent; in the web robot sample code (see link above) that uses the LWP library. If a library is not available in your Perl distribution, you can download and add it using the ppm command, or even manually (see my book page 138, where I discuss how to manually install the library Permutor.pl).
  • Perl scripts can be automatically run according to a pre-established schedule, say once a day. Google 'cron jobs' for details, and check this article for running cron jobs on Cygwin.

Exercise

Write a Perl script that accesses all the text files on your laptop using two steps:

  • recursively using the ls-l > dir.txt Cygwin command from within Perl to create directory listings (one for each folder / subfolder) saved as text files and named dir.txt
  • accessing each text file from each of these automatically created directory listings dir.txt in each directory

Then count the number of occurrences for each word (broken down per file creation year) across these files, using a hash table. Purpose: identify keyword trends in your data.

5. R language

R is a very popular language to perform textbook statistical analyses or nice graphics. I would not use it for black-box applications. These black-box routines must rely on robust techniques such as Jackknife regression or model-free confidence intervals. Large applications such as text clustering involving 20 million keywords, are performed in Python or Perl. Python libraries for data analysis and machine learning are widely available and discussed in a few O'Reilly books: they offer an alternative to R, for big data processing. Note that R does have an extensive collection of sophisticated statistical functions, too many in my opinion. Finally, R is currently used for exploratory data analysis rather than production-mode development. I am not sure how R compares with Python, in terms of speed. For more info, read R versus SAS versus Python.

You can download the open-source R package from The R Project. Installation and running R programs via the GUI, on a Windows laptop, is straightforward. Memory limitations can be bypassed using multiple copies of R on multiples machines, some R packages, or using RHadoop (R + Hadoop). R programs are text files with a .r extension. 

Sample R code

Also check this list of references, many are about R.

6. Advanced Excel

This section is under construction. It will focus on some advanced Excel functions such as Linest (linear regression), Vlookup, quantiles, ranks, random numbers and some data science applications that can easily be performed with Excel, for instance, the following analyses (offered with nice Excel spreadsheet) 

7. Visualization

Many visualizations can be performed with R (see section on R in this article), Excel, or Python or Perl libraries. Specific types of charts (graphs or nicely representing decision trees) requires special software. The most popular software is Tableau. Birt (by Accenture) is popular for dashboards, and Visio (Excel product) for diagrams (patents).

Interesting links

8. Machine Learning

To understand the difference between machine learning and data science, read this article. A large list of machine learning references can be found here. It covers the following domains:

  • Support Vector Machines
  • Clustering
  • Dimensionality Reduction
  • Anomaly Detection
  • Recommender Systems
  • Collaborative Filtering
  • Large Scale Machine Learning
  • Deep Learning
  • Sparse Coding

Also check out our list of data science algorithms: many are considered to be machine learning applications. Finally, a number of additional articles are found in our resources section (constantly updated), as well as in our list of top articles (constantly updated).

9. Projects

We offer both research and applied projects for potential data scientists. Research projects involve working with simulated data, while applied projects involve working on real data. Both simulated or real data sets can be quite large. In addition, we offer interesting data science challenges, our most recent one can be found here (time series and spatial processes). The previous one was on random numbers generation.

Here's our list of projects, as of today.

Applied projects

  1. RSS Feed Exchange. Detecting reputable big data, data science and analytics digital publishers that accept RSS feeds (click here for details), and create an RSS feed exchange where publishers can swap or submit feeds.
  2. Analize 40,000 web pages to optimize content. I can share some traffic statistics about 40,000 pages on DSC, and you work on the data to identify the types of articles and other metrics associated with success (and how do you measure success in the first place?), such as identifying great content for our audience, forecasting articles' lifetime and pageviews based on subject line or category, assessing impact of re-tweets, likes, and sharing on traffic, and detecting factors impacting Google organic traffic. Also, designing a tool to identify new trends and hot keywords would be useful. Lot's of NLP - natural language processing - involved in this type of project; it might also require crawling our websites.This project may not be available to all participants; it requires signing an NDA.
  3. URL shortener that correctly counts traffic. Another potential project is the creation of a redirect URL shortener like http://bit.ly, but one that correctly counts the number of clicks. Bit.ly (and also the Google URL shorterner) provides statistics that are totally wrong for traffic originating from email clients (e.g. Outlook, which represents our largest traffic source). Their numbers are inflated by more than 300%. It's possible that an easy solution consists of counting and reporting the number of users/visitors (after filtering out robots), rather than pageviews. Test your URL re-director and make sure only real human beings are counted (not robots or fake traffic).
  4. Meaningful list and categorization of top data scientists, Other project: create a list of top 500 data scientists or big data experts using public data such as Twitter, and rate them based on number of followers or better criteria (also identify new stars and trends - note that new stars have fewer followers even though they might be more popular, as it takes time to build a list of followers). Classify top practitioners into a number of categories (unsupervised clustering) based on their expertise (identified by keywords or hashtags in their postings). Filter out automated from real tweets - in short identify genuine tweets posted by the author rather than feeds automatically blended with the author's tweets (you can try with my account @AnalyticBridge, which is a blend of external RSS feeds with my own tweets - some posted automatically, some manually). Create groups of data scientists. I started a similar analysis a while back, click here for details.
  5. Data science website. Creating and monetizing (maybe via Amazon books) a blog like ours from scratch, using our RSS feed to provide initial content to visitors: see  http://businessintelligence.com/ for an example of such a website - not producing content, but instead syndicating content from other websites. Scoop.it (and many more) have a similar business model.

Research projects

  1. Spurious correlations in big data, how to detect and fix it. You have n = 5,000 variables uniformly distributed on [0,1]. What is the expected number m of correlations that are above p = 0.95. Perform simulations or find theoretical solution. Try with various values of n (from 5,000 to 100,000) and p (from 0.80 to 0.99) and obtain confidence intervals form (m is a function of n and p). Identify better indicators than correlation to measure whether two time series are really related. The purpose here is twofold: (1) to show that with big data, your strongest correlations are likely to be spurious, and (2) to identify better metrics than correlation in this context. A starting point is my article The curse of big data, also in my book pages 41-45. Or read my article on strong correlations ans answers questions in section 5 and 6.
  2. Robust, simple, multi-usage regression tool for automated data science. The jackknife regression project involves simulated data to create a very robust and simple tool to perform regression and even clustering.
  3. Cracking the maths that make all financial transactions secureClick here for details.
  4. Great random number generatorMost random number generators use an algorithm a(k+1) = f(a(k)) to produce a sequence of integers a(1), a(2), etc. that behaves like random numbers. The function f is integer-valued and bounded; because of these two conditions, the sequence a(k) eventually becomes periodic for k large enough. This is an undesirable property, and many public random number generators (those built in Excel, Python, and other languages) are poor and not suitable for cryptographic applications, Markov Chains Monte-Carlo associated with hierarchical Bayesian models, or large-scale Monte-Carlo simulations to detect extreme events (example: fraud detection, big data context).Click here for details about this project.
  5. Solve the 'Law of Series' problem. Why do we get 4 deadly plane crashes in 4 months, and nothing in several years? This is explained by probablity laws. Read our article, download our simulations (the password for our Excel spreadsheet is 5150) and provide the mathematical solution, using our numerous hints. This project helps you detect coincidences that are just coincidences, versus those that are not. Useful if you want to specialize in root cause analysis, or data science forensics / litigation.

10. Data Sets

Here are some data sets to get you started. Some are internal to us, but freely available to the public. Each of these articles has a link the related data set, or the data set is available as an attachment in some cases.

Here is another list featuring 100 data sets. KDNuggets also maintains a fairly comprehensive list of data sets

11. Miscellaneous

We will add more content to this section. In particular, about SQL/NewSQL, SAS, Hadoop, Scala, Julia, machine learning libraries, Mahout and other items.

Related links

Views: 15365

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Maloy Manna on Wednesday

Excellent compilation - just one niggle though - avoid using Cygwin on Windows or Homebrew on MacOSX - it may mess up some big data frameworks. It's probably best to use Linux with a Virtualbox - it also allows you to use nice IDEs like Dataiku.

Great work though - thanks Vincent!

Comment by Milton Labanda on October 10, 2014 at 6:24pm

@Vincent, what do you think about use only opensource softaware office (Writer, Calc) instead Excel, Word? some negative experience with these?

Comment by Janet Dobbins on October 6, 2014 at 9:53am

Thanks for this list, Vincent, I have listed your blog as a resource on our site too.  http://www.statistics.com/data-science/

Comment by Kalyanaraman K on August 7, 2014 at 3:34pm
One of the important and useful post. Thanks.
Comment by Vincent Granville on August 7, 2014 at 10:11am

Jeffrey: Is using a Windows machine critical?

Maybe not. I recommend Windows because I'm used to it and I use Excel quite a lot, as well as Word, and it has many apps such as Perl / Cygwin, easy to work with. Files are compatible with what I get from clients, though file formats is easy to address.

If you work with Excel spreadsheets on your Mac, no reasons to buy a Windows machine. Ideally, a Unix laptop with Excel is good. Personally, I don't like Microsoft products, don't use IE, don't use Outlook, and don't use most Microsoft features on my Laptop.

Comment by Jeffrey Keeton on August 7, 2014 at 9:07am

Planning to purchase a new laptop...what are everybody's thoughts on using a macbook (pro or air)? Is using a Windows machine critical?

Comment by Nasir M. Uddin, Ph.D. on August 5, 2014 at 6:21am

Very useful information - thank you very much. Regards, Nasir 

Follow Us

Videos

  • Add Videos
  • View All

© 2014   Data Science Central

Badges  |  Report an Issue  |  Terms of Service