The following problems appeared in the first few assignments in the **Udacity course Deep Learning (by Google)**. The descriptions of the problems are taken from the assignments.

Let’s first learn about simple data curation practices, and familiarize ourselves with some of the data that are going to be used for ** deep learning** using

First the dataset needs to be downloaded and extracted to a local machine. The data consists of characters rendered in a variety of fonts on a *28×28* image. The labels are limited to ‘A’ through ‘J’ (10 classes). The *training set* has about 500k and the *testset* 19000 labelled examples. Given these sizes, it should be possible to train models quickly on any machine.

Let’s take a peek at some of the data to make sure it looks sensible. Each exemplar should be an image of a character A through J rendered in a different font. Display a sample of the images downloaded.

Now let’s load the data in a more manageable format. Let’s convert the entire dataset into a 3D array (image index, x, y) of floating point values, normalized to have approximately zero mean and standard deviation ~0.5 to make training easier down the road. A few images might not be readable, we’ll just skip them. Also the data is expected to be balanced across classes. Let’s verify that. The following output shows the dimension of the *ndarray* for each class.

(52909, 28, 28) (52911, 28, 28) (52912, 28, 28) (52911, 28, 28) (52912, 28, 28) (52912, 28, 28) (52912, 28, 28) (52912, 28, 28) (52912, 28, 28) (52911, 28, 28)

Now some more preprocessing is needed:

- First the training data (for different classes) needs to be merged and pruned.
- The labels will be stored into a separate array of integers from 0 to 9.
- Also let’s create a validation dataset for hyperparameter tuning.
- Finally the data needs to be randomized. It’s important to have the labels well shuffled for the training and test distributions to match.

After *preprocessing,* let’s peek a few samples from the **training dataset** and the next figure shows how it looks.

Here is how the **validation dataset** looks (a few samples chosen).

Here is how the **test dataset** looks (a few samples chosen).

Let’s get an idea of what an *off-the-shelf classifier* can give us on this data. It’s always good to check that there is something to learn, and that it’s a problem that is not so trivial that a canned solution solves it. Let’s first train a simple **LogisticRegression** model from *sklearn* (using default parameters) on this data using **5000** training samples.

The following figure shows the output from the logistic regression model trained, its accuracy on the test dataset (also the *confusion matrix*) and a few test instances classified wrongly (**predicted** labels along with the **true** labels) by the model.

Now let’s progressively train deeper and more *accurate models* using **TensorFlow**. Again, we need to do the following preprocessing:

- Reformat into a shape that’s more adapted to the models we’re going to train: data as a flat matrix,
- labels as
**1-hot encodings**.

Now let’s train a ** multinomial logistic regression** using simple

**TensorFlow** works like this:

First we need to describe the **computation** that is to be performed: what the inputs, the variables, and the operations look like. These get created as *nodes* over a **computation graph**.

Then we can run the operations on this graph as many times as we want by calling **session.run()**, providing it outputs to fetch from the graph that get returned.

- Let’s load all the data into
**TensorFlow**and build the**computation graph**corresponding to our training. - Let’s use
**stochastic gradient descent**training (with**~3k steps**), which is much faster. The graph will be similar to*batch gradient descent*, except that instead of holding all the training data into a constant node, we create a*Placeholder node*which will be fed actual data at every call of*session.run()*.

The following shows the ** fully connected computation graph** and the results obtained.

Now let’s turn the **logistic regression** example with **SGD** into a **1-hidden layer neural network** with **rectified linear units ***nn.relu()* and **1024 hidden nodes**. As can be seen from the below results, this model improves the validation / test accuracy.

Previously we trained a **logistic regression** and a **neural network** model with **Tensorflow**. Let’s now explore the **regularization** techniques.

Let’s introduce and tune L2**regularization** for both ** logistic** and

The right amount of *regularization* improves the validation / test accuracy, as can be seen from the following results.

The following figure recapitulates the simple network without anyt hidden layer, with *softmax* outputs.

The next figures visualize the weights learnt for the 10 output neurons at different *steps* using **SGD** and **L2 regularized loss function **(with λ=0.005). As can be seen below, the weights learnt are gradually capturing the

As can be seen, the **test accuracy **also gets improved to

The following results show the **accuracy** and the **weights** learnt for couple of different values of** λ** (**0.01** and **0.05** respectively). As can be seen, with higher values of** λ**, the **logistic regression** model tends to *underfit* and *test accuracy* decreases.

The following figure recapitulates the neural network with a** single hidden layer** with

The next figures visualize the weights learnt for *225* randomly selected input neurons (out of *28×28*) at different steps using *SGD* and **L2**** regularized** loss function (with **λ1 = λ2 = 0.01**). As can be seen below, the weights learnt are gradually capturing (as the SGD steps increase) the different features of the letters at the corresponding output neurons.

If the **regularization parameters** used are **λ1=λ2=0.005**, the *test accuracy* increases to **91.1%**, as shown below.

Let’s demonstrate an *extreme* case of **overfitting**. Restrict your training data to just a *few batches*. What happens?

Let’s restrict the training data to** n=5** batches. The following figures show how it increases the

Introduce **Dropout** on the *hidden layer* of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise our evaluation results would be stochastic as well. **TensorFlow** provides nn.dropout() for that, but you have to make sure it’s only inserted during training.

What happens to our *extreme overfitting* case?

As we can see from the below results, introducing a **dropout rate** of **0.4** increases the *validation* and *test accuracy* by reducing the **overfitting**.

Till this point the **highest accuracy** on the** test dataset** using a single hidden-layer neural network is **91.1%**. More hidden layers can be used / some other techniques (e.g., *exponential decay in learning rate* can be used) to improve the accuracy obtained (** to be continued…**).

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**New Books and Resources for DSC Members** - [See Full List]

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central