This problem also appeared as an assignment problem in the coursera online course Mathematics for Machine Learning: Multivariate Calculus. The description of the problem is taken from the assignment itself.
In this assignment, we shall train a neural network to draw a curve. The curve takes one input variable, the amount traveled along the curve from 0 to 1, and returns 2 outputs, the 2D coordinates of the position of points on the curve.
The below table shows the first few rows of the dataset. Here x is the input variable and y1, y2 are the output variables.
x  y1  y2  

0  0.00  0.500000  0.625000 
1  0.01  0.500099  0.627015 
2  0.02  0.500788  0.632968 
3  0.03  0.502632  0.642581 
4  0.04  0.506152  0.655407 
5  0.05  0.511803  0.670852 
6  0.06  0.519955  0.688198 
7  0.07  0.530875  0.706641 
8  0.08  0.544723  0.725326 
9  0.09  0.561537  0.743386 
The next figures show how the data looks:
To help capture the complexity of the curve, we shall use two hidden layers in our network with 6 and 7 neurons respectively.
We shall implement functions to calculate the Jacobian of the cost function, with respect to the weights and biases of the network. The code will form part of a stochastic steepest descent algorithm that will train our network.
The following figure shows the feedforward equations,
https://sandipanweb.files.wordpress.com/2018/05/f11.png?w=150&h=43 150w" sizes="(maxwidth: 208px) 100vw, 208px" />
In the following python code (taken from the same assignment) defines functions to set up our neural network. Namely an activation function, σ(z), it’s derivative, σ′(z), a function to initialize weights and biases, and a function that calculates each activation of the network using feedforward.
In this assignment we shall use the logistic function as our activation function, rather than the more familiar tanh or relu.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

sigma = lambda z : 1 / ( 1 + np.exp(  z)) d_sigma = lambda z : np.cosh(z / 2 ) * * (  2 ) / 4 # This function initialises the network with it's structure, it also resets any training already done. def reset_network (n1 = 6 , n2 = 7 , random = np.random) : global W1, W2, W3, b1, b2, b3 W1 = random.randn(n1, 1 ) / 2 W2 = random.randn(n2, n1) / 2 W3 = random.randn( 2 , n2) / 2 b1 = random.randn(n1, 1 ) / 2 b2 = random.randn(n2, 1 ) / 2 b3 = random.randn( 2 , 1 ) / 2 # This function feeds forward each activation to the next layer. It returns all weighted sums and activations. def network_function(a0) : z1 = W1 @ a0 + b1 a1 = sigma(z1) z2 = W2 @ a1 + b2 a2 = sigma(z2) z3 = W3 @ a2 + b3 a3 = sigma(z3) return a0, z1, a1, z2, a2, z3, a3 # This is the cost function of a neural network with respect to a training set. def cost(x, y) : return np.linalg.norm(network_function(x)[  1 ]  y) * * 2 / x.size 
Next we need to implement the functions for the Jacobian of the cost function with respect to the weights and biases. We will start with layer 3, which is the easiest, and work backwards through the layers.
The following python code shows how the J_W3 function can be implemented.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

# Jacobian for the third layer weights. def J_W3 (x, y) : # First get all the activations and weighted sums at each layer of the network. a0, z1, a1, z2, a2, z3, a3 = network_function(x) # We'll use the variable J to store parts of our result as we go along, updating it in each line. # Firstly, we calculate dC/da3, using the expressions above. J = 2 * (a3  y) # Next multiply the result we've calculated by the derivative of sigma, evaluated at z3. J = J * d_sigma(z3) # Then we take the dot product (along the axis that holds the training examples) with the final partial derivative, # i.e. dz3/dW3 = a2 # and divide by the number of training examples, for the average over all training examples. J = J @ a2.T / x.size # Finally return the result out of the function. return J 
The following python code snippet implements the Gradient Descent algorithm (where the parameter aggression represents the learning rate and noise acts as a regularization parameter here):
1
2
3
4
5
6
7
8
9
10
11
12
13
14

while iterations < max_iteration: j_W1 = J_W1(x, y) * ( 1 + np.random.randn() * noise) j_W2 = J_W2(x, y) * ( 1 + np.random.randn() * noise) j_W3 = J_W3(x, y) * ( 1 + np.random.randn() * noise) j_b1 = J_b1(x, y) * ( 1 + np.random.randn() * noise) j_b2 = J_b2(x, y) * ( 1 + np.random.randn() * noise) j_b3 = J_b3(x, y) * ( 1 + np.random.randn() * noise) W1 = W1  j_W1 * aggression W2 = W2  j_W2 * aggression W3 = W3  j_W3 * aggression b1 = b1  j_b1 * aggression b2 = b2  j_b2 * aggression b3 = b3  j_b3 * aggression 
The next figures and animations show how the prediction curve (pink / orange) with the neural net approaches the original curve (green) as it's trained longer.
With learning rate = 7
The next figures and animations visualize all the curves learnt at different iterations.
With learning rate = 1
We can change parameters of the steepest descent algorithm, e.g., how aggressive the learning step is, and how much noise to add. The next figure shows the actual and the predicted curves for a few different values of the paramaters.
We can compute the model error (sum of square deviation in between the actual and predicted outputs) for different values of the parameters. The next heatmap shows how the model error varies with the aggression and noise parameter values.
This problem appeared in an assignment in the edX course Machine Learning Fundamentals by UCSD (by Prof. Sanjay Dasgupta).
We need to implement the Kernel Perceptron algorithm to classify some datasets that are not linearly separable. The algorithm should allow kernels like the quadratic and RBF kernel.
The next figure describes the theory and the algorithm (in dual form) for Kernel Perceptron. As can be seen, the kernel trick can be used both at the training and the prediction time to avoid basis expansion (by replacing the dot products of the expanded feature vectors with a Mercer Kernel).
The datasets on which we are going to classify with the dual perceptron algorithm are 2dimensional datasets, each datapoint with a label +1 or 1, the first few datapoints of a dataset is shown below:
X1  X2  y  

0  1.0  1.0  1.0 
1  2.0  1.0  1.0 
2  3.0  1.0  1.0 
3  4.0  1.0  1.0 
4  5.0  1.0  1.0 
The groundtruth of the datapoints are represented by their color (red and black) and marker type (circle and triangle respectively).
The datapoint that is misclassified in a particular iteration is shown in blue.
When a misclassified point is selected, the corresponding alpha value is upvoted, this is indicated by increase in the size of the datapoint.
The decision boundary for the two classes are shown with green and magenta colors, respectively.
The following figures and animations show the classification of the datasets using kernel perceptron with RBF and quadratic kernels. The next python code snippet implements the kernel functions.
1
2
3
4
5
6
7

import numpy as np def kernel(x, z, type , s): if type = = 'rbf' : return np.exp(  np.dot(x  z, x  z) / s * * 2 ) if type = = 'quadratic' : return ( 1 + np.dot(x, z)) * * 2 return np.dot(x, z) 
The next figures / animations show the result of classification with a python implementation of the (Dual) Kernel Perceptron Algorithm.
Dataset 1
Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.
Dataset 2
Results with RBF Kernel
Results with quadratic Kernel
Dataset 3
Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.
C
is the setting of the softmargin parameter C
(default: 1.0)s
(for the RBF kernel) is the scaling parameter s
(default: 1.0)Dataset 4
Results with RBF Kernel
Results with quadratic Kernel
Dataset 5
Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.
The following code and the figures show the decision boundaries and the support vectors (datapoints with larger size) learnt with sklearn SVC.
1
2
3
4
5
6

from sklearn.svm import SVC x = data[:, 0 : 2 ] y = data[:, 2 ] clf = SVC(kernel = kernel, C = C, kernel = 'rbf' , gamma = 1.0 / (s * s)) clf.fit(x,y) clf.support_ 
Dataset 1
With polynomial kernel (degree=2, C=1)
With RBF kernel (C=10, σ = 10)
Dataset 2
With polynomial kernel (degree=2, C=1)
https://sandipanweb.files.wordpress.com/2018/05/d2sq.png?w=150&... 150w, https://sandipanweb.files.wordpress.com/2018/05/d2sq.png?w=300&... 300w" sizes="(maxwidth: 571px) 100vw, 571px" />
With RBF kernel (C=10, σ = 10)
Dataset 3
With RBF kernel (C=10, σ = 10)
© 2019 Data Science Central ® Powered by
Badges  Report an Issue  Privacy Policy  Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Technical
Non Technical
Articles from top bloggers
Other popular resources
Archives: 20082014  20152016  20172019  Book 1  Book 2  More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central