# 1. Back-propagation

This problem also appeared as an assignment problem in the coursera online course Mathematics for Machine Learning: Multivariate Calculus. The description of the problem is taken from the assignment itself.

In this assignment, we shall train a neural network to draw a curve. The curve takes one input variable, the amount traveled along the curve from 0 to 1, and returns 2 outputs, the 2D coordinates of the position of points on the curve.

The below table shows the first few rows of the dataset. Here x is the input variable and y1, y2 are the output variables.

x y1 y2
0 0.00 0.500000 0.625000
1 0.01 0.500099 0.627015
2 0.02 0.500788 0.632968
3 0.03 0.502632 0.642581
4 0.04 0.506152 0.655407
5 0.05 0.511803 0.670852
6 0.06 0.519955 0.688198
7 0.07 0.530875 0.706641
8 0.08 0.544723 0.725326
9 0.09 0.561537 0.743386

The next figures show how the data looks:   To help capture the complexity of the curve, we shall use two hidden layers in our network with 6 and 7 neurons respectively. We shall implement functions to calculate the Jacobian of the cost function, with respect to the weights and biases of the network. The code will form part of a stochastic steepest descent algorithm that will train our network.

## Feed forward

The following figure shows the feed-forward equations, https://sandipanweb.files.wordpress.com/2018/05/f11.png?w=150&h=43 150w" sizes="(max-width: 208px) 100vw, 208px" />

In the following python code (taken from the same assignment) defines functions to set up our neural network. Namely an activation function, σ(z), it’s derivative, σ(z), a function to initialize weights and biases, and a function that calculates each activation of the network using feed-forward.

In this assignment we shall use the logistic function as our activation function, rather than the more familiar tanh or relu. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 sigma = lambda z : 1 / (1 + np.exp(-z)) d_sigma = lambda z : np.cosh(z/2)**(-2) / 4   # This function initialises the network with it's structure, it also resets any training already done. def reset_network (n1 = 6, n2 = 7, random=np.random) :  global W1, W2, W3, b1, b2, b3  W1 = random.randn(n1, 1) / 2  W2 = random.randn(n2, n1) / 2  W3 = random.randn(2, n2) / 2  b1 = random.randn(n1, 1) / 2  b2 = random.randn(n2, 1) / 2  b3 = random.randn(2, 1) / 2   # This function feeds forward each activation to the next layer. It returns all weighted sums and activations. def network_function(a0) :  z1 = W1 @ a0 + b1  a1 = sigma(z1)  z2 = W2 @ a1 + b2  a2 = sigma(z2)  z3 = W3 @ a2 + b3  a3 = sigma(z3)  return a0, z1, a1, z2, a2, z3, a3   # This is the cost function of a neural network with respect to a training set. def cost(x, y) :  return np.linalg.norm(network_function(x)[-1] - y)**2 / x.size

## Backpropagation

Next we need to implement the functions for the Jacobian of the cost function with respect to the weights and biases. We will start with layer 3, which is the easiest, and work backwards through the layers.  The following python code shows how the J_W3 function can be implemented.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Jacobian for the third layer weights. def J_W3 (x, y) :  # First get all the activations and weighted sums at each layer of the network.  a0, z1, a1, z2, a2, z3, a3 = network_function(x)  # We'll use the variable J to store parts of our result as we go along, updating it in each line.  # Firstly, we calculate dC/da3, using the expressions above.  J = 2 * (a3 - y)  # Next multiply the result we've calculated by the derivative of sigma, evaluated at z3.  J = J * d_sigma(z3)  # Then we take the dot product (along the axis that holds the training examples) with the final partial derivative,  # i.e. dz3/dW3 = a2  # and divide by the number of training examples, for the average over all training examples.  J = J @ a2.T / x.size  # Finally return the result out of the function.  return J

The following python code snippet implements the Gradient Descent algorithm (where the parameter aggression represents the learning rate and noise acts as a regularization parameter here):

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 while iterations < max_iteration:  j_W1 = J_W1(x, y) * (1 + np.random.randn() * noise)  j_W2 = J_W2(x, y) * (1 + np.random.randn() * noise)  j_W3 = J_W3(x, y) * (1 + np.random.randn() * noise)  j_b1 = J_b1(x, y) * (1 + np.random.randn() * noise)  j_b2 = J_b2(x, y) * (1 + np.random.randn() * noise)  j_b3 = J_b3(x, y) * (1 + np.random.randn() * noise)    W1 = W1 - j_W1 * aggression  W2 = W2 - j_W2 * aggression  W3 = W3 - j_W3 * aggression  b1 = b1 - j_b1 * aggression  b2 = b2 - j_b2 * aggression  b3 = b3 - j_b3 * aggression

The next figures and animations show how the prediction curve (pink / orange) with the neural net approaches the original curve (green) as it's trained longer.

With learning rate = 7  The next figures and animations visualize all the curves learnt at different iterations.  With learning rate = 1  We can change parameters of the steepest descent algorithm, e.g., how aggressive the learning step is, and how much noise to add. The next figure shows the actual and the predicted curves for a few different values of the paramaters. We can compute the model error (sum of square deviation in between the actual and predicted outputs) for different values of the parameters. The next heatmap shows how the model error varies with the aggression and noise parameter values. ## 2. The Kernel Perceptron

This problem appeared in an assignment in the edX course Machine Learning Fundamentals by UCSD (by Prof. Sanjay Dasgupta).

We need to implement the Kernel Perceptron algorithm to classify some datasets that are not linearly separable. The algorithm should allow kernels like the quadratic and RBF kernel.

The next figure describes the theory and the algorithm (in dual form) for Kernel Perceptron. As can be seen, the kernel trick can be used both at the training and the prediction time to avoid basis expansion (by replacing the dot products of the expanded feature vectors with a Mercer Kernel). The datasets on which we are going to classify with the dual perceptron algorithm are 2-dimensional datasets, each datapoint with a label +1 or -1, the first few datapoints of a dataset is shown below:

X1 X2 y
0 1.0 1.0 1.0
1 2.0 1.0 1.0
2 3.0 1.0 1.0
3 4.0 1.0 1.0
4 5.0 1.0 1.0

The ground-truth of the data-points are represented by their color (red and black) and marker type (circle and triangle respectively).

The data-point that is mis-classified in a particular iteration is shown in blue.

When a mis-classified point is selected, the corresponding alpha value is up-voted, this is indicated by increase in the size of the data-point.

The decision boundary for the two classes are shown with green and magenta colors, respectively.

The following figures and animations show the classification of the datasets using kernel perceptron with RBF and quadratic kernels. The next python code snippet implements the kernel functions.

 1 2 3 4 5 6 7 import numpy as np def kernel(x, z, type, s):  if type == 'rbf':  return np.exp(-np.dot(x-z, x-z)/s**2)  if type == 'quadratic':  return (1 + np.dot(x, z))**2  return np.dot(x, z)

## Results

The next figures / animations show the result of classification with a python implementation of the (Dual) Kernel Perceptron Algorithm.

Dataset 1

Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.  Dataset 2

Results with RBF Kernel    Dataset 3

Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.

• C is the setting of the soft-margin parameter C (default: 1.0)
• s (for the RBF kernel) is the scaling parameter s (default: 1.0)  Dataset 4

Results with RBF Kernel    Dataset 5

Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.  ## Results with Kernel SVM Classifier (sklearn)

The following code and the figures show the decision boundaries and the support vectors (datapoints with larger size) learnt with sklearn SVC.

 1 2 3 4 5 6 from sklearn.svm import SVC x = data[:,0:2] y = data[:,2] clf = SVC(kernel=kernel, C=C, kernel = 'rbf', gamma=1.0/(s*s)) clf.fit(x,y) clf.support_

Dataset 1

With polynomial kernel (degree=2, C=1) With RBF kernel (C=10, σ = 10) Dataset 2

With polynomial kernel (degree=2, C=1) https://sandipanweb.files.wordpress.com/2018/05/d2sq.png?w=150&... 150w, https://sandipanweb.files.wordpress.com/2018/05/d2sq.png?w=300&... 300w" sizes="(max-width: 571px) 100vw, 571px" />

With RBF kernel (C=10, σ = 10) Dataset 3

With RBF kernel (C=10, σ = 10) Views: 5561

Comment

Join Data Science Central