Subscribe to DSC Newsletter

Few Machine Learning Problems (with Python implementation)

1. Back-propagation

This problem also appeared as an assignment problem in the coursera online course Mathematics for Machine Learning: Multivariate Calculus. The description of the problem is taken from the assignment itself.

In this assignment, we shall train a neural network to draw a curve. The curve takes one input variable, the amount traveled along the curve from 0 to 1, and returns 2 outputs, the 2D coordinates of the position of points on the curve.

The below table shows the first few rows of the dataset. Here x is the input variable and y1, y2 are the output variables.

 

x y1 y2
0 0.00 0.500000 0.625000
1 0.01 0.500099 0.627015
2 0.02 0.500788 0.632968
3 0.03 0.502632 0.642581
4 0.04 0.506152 0.655407
5 0.05 0.511803 0.670852
6 0.06 0.519955 0.688198
7 0.07 0.530875 0.706641
8 0.08 0.544723 0.725326
9 0.09 0.561537 0.743386

The next figures show how the data looks:

p3

p2.pngp1

To help capture the complexity of the curve, we shall use two hidden layers in our network with 6 and 7 neurons respectively.

bigNet.png

We shall implement functions to calculate the Jacobian of the cost function, with respect to the weights and biases of the network. The code will form part of a stochastic steepest descent algorithm that will train our network.

Feed forward

The following figure shows the feed-forward equations,

f1https://sandipanweb.files.wordpress.com/2018/05/f11.png?w=150&h=43 150w" sizes="(max-width: 208px) 100vw, 208px" />

In the following python code (taken from the same assignment) defines functions to set up our neural network. Namely an activation function, σ(z), it’s derivative, σ(z), a function to initialize weights and biases, and a function that calculates each activation of the network using feed-forward.

In this assignment we shall use the logistic function as our activation function, rather than the more familiar tanh or relu.

f2.png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
sigma = lambda z : 1 / (1 + np.exp(-z))
d_sigma = lambda z : np.cosh(z/2)**(-2) / 4
 
# This function initialises the network with it's structure, it also resets any training already done.
def reset_network (n1 = 6, n2 = 7, random=np.random) :
 global W1, W2, W3, b1, b2, b3
 W1 = random.randn(n1, 1) / 2
 W2 = random.randn(n2, n1) / 2
 W3 = random.randn(2, n2) / 2
 b1 = random.randn(n1, 1) / 2
 b2 = random.randn(n2, 1) / 2
 b3 = random.randn(2, 1) / 2
 
# This function feeds forward each activation to the next layer. It returns all weighted sums and activations.
def network_function(a0) :
 z1 = W1 @ a0 + b1
 a1 = sigma(z1)
 z2 = W2 @ a1 + b2
 a2 = sigma(z2)
 z3 = W3 @ a2 + b3
 a3 = sigma(z3)
 return a0, z1, a1, z2, a2, z3, a3
 
# This is the cost function of a neural network with respect to a training set.
def cost(x, y) :
 return np.linalg.norm(network_function(x)[-1] - y)**2 / x.size

 

 

Backpropagation

Next we need to implement the functions for the Jacobian of the cost function with respect to the weights and biases. We will start with layer 3, which is the easiest, and work backwards through the layers.

f7.png

f3.png

The following python code shows how the J_W3 function can be implemented.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Jacobian for the third layer weights.
def J_W3 (x, y) :
 # First get all the activations and weighted sums at each layer of the network.
 a0, z1, a1, z2, a2, z3, a3 = network_function(x)
 # We'll use the variable J to store parts of our result as we go along, updating it in each line.
 # Firstly, we calculate dC/da3, using the expressions above.
 J = 2 * (a3 - y)
 # Next multiply the result we've calculated by the derivative of sigma, evaluated at z3.
 J = J * d_sigma(z3)
 # Then we take the dot product (along the axis that holds the training examples) with the final partial derivative,
 # i.e. dz3/dW3 = a2
 # and divide by the number of training examples, for the average over all training examples.
 J = J @ a2.T / x.size
 # Finally return the result out of the function.
 return J

The following python code snippet implements the Gradient Descent algorithm (where the parameter aggression represents the learning rate and noise acts as a regularization parameter here):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
while iterations < max_iteration:
 j_W1 = J_W1(x, y) * (1 + np.random.randn() * noise)
 j_W2 = J_W2(x, y) * (1 + np.random.randn() * noise)
 j_W3 = J_W3(x, y) * (1 + np.random.randn() * noise)
 j_b1 = J_b1(x, y) * (1 + np.random.randn() * noise)
 j_b2 = J_b2(x, y) * (1 + np.random.randn() * noise)
 j_b3 = J_b3(x, y) * (1 + np.random.randn() * noise)
 
 W1 = W1 - j_W1 * aggression
 W2 = W2 - j_W2 * aggression
 W3 = W3 - j_W3 * aggression
 b1 = b1 - j_b1 * aggression
 b2 = b2 - j_b2 * aggression
 b3 = b3 - j_b3 * aggression

The next figures and animations show how the prediction curve (pink / orange) with the neural net approaches the original curve (green) as it's trained longer.

With learning rate = 7

out1
out_10001

The next figures and animations visualize all the curves learnt at different iterations.
out1_out_10001

With learning rate = 1

out1_1.gif

out_10001

We can change parameters of the steepest descent algorithm, e.g., how aggressive the learning step is, and how much noise to add. The next figure shows the actual and the predicted curves for a few different values of the paramaters.

res.png

We can compute the model error (sum of square deviation in between the actual and predicted outputs) for different values of the parameters. The next heatmap shows how the model error varies with the aggression and noise parameter values.

p4

2. The Kernel Perceptron

This problem appeared in an assignment in the edX course Machine Learning Fundamentals by UCSD (by Prof. Sanjay Dasgupta).

We need to implement the Kernel Perceptron algorithm to classify some datasets that are not linearly separable. The algorithm should allow kernels like the quadratic and RBF kernel.

The next figure describes the theory and the algorithm (in dual form) for Kernel Perceptron. As can be seen, the kernel trick can be used both at the training and the prediction time to avoid basis expansion (by replacing the dot products of the expanded feature vectors with a Mercer Kernel).

f1.png

The datasets on which we are going to classify with the dual perceptron algorithm are 2-dimensional datasets, each datapoint with a label +1 or -1, the first few datapoints of a dataset is shown below:

X1 X2 y
0 1.0 1.0 1.0
1 2.0 1.0 1.0
2 3.0 1.0 1.0
3 4.0 1.0 1.0
4 5.0 1.0 1.0

 

The ground-truth of the data-points are represented by their color (red and black) and marker type (circle and triangle respectively).

The data-point that is mis-classified in a particular iteration is shown in blue.

When a mis-classified point is selected, the corresponding alpha value is up-voted, this is indicated by increase in the size of the data-point.

The decision boundary for the two classes are shown with green and magenta colors, respectively.

The following figures and animations show the classification of the datasets using kernel perceptron with RBF and quadratic kernels. The next python code snippet implements the kernel functions.

1
2
3
4
5
6
7
import numpy as np
def kernel(x, z, type, s):
 if type == 'rbf':
 return np.exp(-np.dot(x-z, x-z)/s**2)
 if type == 'quadratic':
 return (1 + np.dot(x, z))**2
 return np.dot(x, z)

 

Results

The next figures / animations show the result of classification with a python implementation of the (Dual) Kernel Perceptron Algorithm.


Dataset 1

Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.
kp1

out1_008_051

Dataset 2

Results with RBF Kernel
kp2out2_002_035

Results with quadratic Kernel
kp2qout2q_008_035

Dataset 3

Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.

  • C is the setting of the soft-margin parameter C (default: 1.0)
  • s (for the RBF kernel) is the scaling parameter s (default: 1.0)

kp3

out3_003_089

Dataset 4

Results with RBF Kernel
kp4

out4_004_075

Results with quadratic Kernel

kp4qout4q_006_075

Dataset 5

Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel. kp5out5_002_059

Results with Kernel SVM Classifier (sklearn)

The following code and the figures show the decision boundaries and the support vectors (datapoints with larger size) learnt with sklearn SVC.

1
2
3
4
5
6
from sklearn.svm import SVC
x = data[:,0:2]
y = data[:,2]
clf = SVC(kernel=kernel, C=C, kernel = 'rbf', gamma=1.0/(s*s))
clf.fit(x,y)
clf.support_

Dataset 1

With polynomial kernel (degree=2, C=1)

d1sq

With RBF kernel (C=10, σ = 10)

d1s

 

Dataset 2

With polynomial kernel (degree=2, C=1)

d2sqhttps://sandipanweb.files.wordpress.com/2018/05/d2sq.png?w=150&... 150w, https://sandipanweb.files.wordpress.com/2018/05/d2sq.png?w=300&... 300w" sizes="(max-width: 571px) 100vw, 571px" />

With RBF kernel (C=10, σ = 10)

d2s

Dataset 3

With RBF kernel (C=10, σ = 10)

d3s.png

Views: 3740

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service