Guest blog post by Blaine Batman. Based on two R packages for neural networks.
In this article, I compare two available R packages for using neural networks to model data: neuralnet and deepnet. Through the comparisons I highlight various challenges in finding good hyperparameter values. I show that some needed hyperparameters differ when using these two packages, even with the same underlying algorithmic approach. neuralnet was developed by Stefan Fritsch and Frauke Guenther with contributors Marc Suling and Sebastian M. Mueller. deepnet was created by Xiao Rong. Both packages can be obtained via the R CRAN repository (see links at the end). I will focus on a simple time series example, composed of two predictors and the performance of the packages to predict future data after being trained on past data using a simple 5-neuron network. Note that most of what you read about in deep learning with neural networks are “classification” problems (more later); nonetheless such networks have promise for predicting continuous data including time series.
Brief overview of simple neural networks
Briefly, a neural network (also called a multilayer-perceptron etc.) is a connected network of neurons as shown here.
Figure 1. An example neural network (generated using neuralnet).
Note that except for the input layer (where the predictor values are fed in), the inputs to a neuron have weights specific to that neuron, so the output of a neuron is “re-used” as input to all neurons in the next layer, with unique weights. Before moving on to a brief description of how neural networks compute predictions, it is worth reflecting on the number of independent parameters in neural network models as compared to, for example, linear regression.
If we applied linear regression to the problem of figure 1, the model would have the form:
m0 + m1*x1 + m2*x2 + m3*x3 + m4*x4 + m5*x5 = prediction
where m0 is the constant or bias term, and the mi are the coefficients for the input values. The linear regression algorithm is used to determine the values of m that provide the best prediction, a total of six parameters. Linear regression models sometimes can be solved with a closed form method called the normal equation, as well as using linear algebraic approaches. Most machine learning languages have very optimized packages for these types of problems, so linear regression is generally very fast and computationally efficient.
Looking back at figure 1, for the same 5 input model, with two layers each having 3 hidden neurons and one bias node, and the output layer and its bias node, there are a total of 34 weights to be determined by the model, or more than 5 time as many as the linear regression model. It is evident that increasing hidden layers and their size rapidly increases the total number of coefficients. Surprisingly, the neural network model is not over specified, yet it is also evident that since each node impacts all the other nodes, very complex behavior could be modeled. We could try to add terms to the linear regression model, such as all the second order terms (e.g. mi11*x1^2, mi12*x1*x2, etc.); using all the second order terms increases the number of parameters to 21. However, doing so is not guaranteed to lead to a good predictive model; if you fit polynomial terms in a linear model, it may extrapolate very badly. Neural networks may be a better choice.
Neuron function and equations
Neurons are computational nodes—the inputs into a neuron are summed, then an activation function is used to generate the output. Figure 2 shows how this works.
Figure 2. How neuron output is computed. The output of neuron 2,1 is a function of all the neurons in the previous layer, as well as the bias value. The weights of all those inputs are unique to neuron 2,1; the same inputs are fed to other neurons with different weights. In neuron 2,1 the inputs are summed, then fed to fact.
The fact is called the activation function, and can be a nonlinear function of the summed inputs. A common activation function is the sigmoid (also known as logistic function) function:
where x is the sum of the inputs. This function mimics biological neurons in that the neuron doesn’t “fire” until the input reaches a threshold, by approximating a 0 to 1 step change. There are other activation functions that can be used; generally it is desirable that the function be differentiable (see below for why), but in principal nearly any function can be used, even a distribution (like a normal distribution). The sigmoid function output vs. inputs is shown in figure 3.
Figure 3. A typical activation function, the sigmoid function varies in a highly non-linear way with respect to the input.
Linear output neurons
A small but important point is that the output layer, which has just one neuron in figure 2, may use a different activation function. Note that the layer can have more than one node, such as if we train a network to recognize digits (such as reading zip codes for the Postal Service sort), there would be 10 node, one for each digit. In classification problems (again, when the target has a finite number of values, like digits or letters or cats vs. non-cats) usually use a non-linear activation function in the output layer, because we want the correct choice to be near 1 and the rest near zero. However, it is possible to use neural networks to model continuous data, including time series. In those cases, which I explore below, the output layer has a single neuron with a linear activation function.
The way the best weights are found is to calculate the error on each iteration, where the errors are the difference between the outputs and target values for the entire data set, or a sample of the data, then adjusting the weights to reduce the error. When using very large data sets and/or very complex networks, the computational cost of using the full data set on every update can be too high, depending on the compute resources available. Using a sample of the data instead of the entire set is referred to as batch gradient descent. The batch size is a hyperparameter. deepnet has a default batch size of 100, which I did not adjust in this work, while neuralnet does not use this parameter. As the data sets used here are small, this doesn’t make much difference. Interestingly, a new paper from Google Brain (see https://arxiv.org/abs/1711.00489v1) suggests that tuning the batch size is more effective than adjusting the learning rate to achieve fastest convergence on large data.
In practice the squared errors are used in the error function, hence the very common error function Root Mean Squared Error (RMSE). RMSE is calculated by squaring each error term (the difference between the output and the target for a single instance in the data), summing, dividing by the number of training instances, then taking the square root. Using RMSE penalizes errors regardless if they are positive or negative, and there are statistical arguments for using it as the error function. Here, I will use only the error function RMSE.
Training vs. test data
An important nuance is to distinguish between training data and test data. In any machine learning approach, it is important to subset data into training data, for which the algorithm is given the “correct” (here I call the target values) values, and test data. The test data are not used in determining the best weights, so they are often referred to as “unseen data”. For classification problems, the train/test data are often taken as a split, usually done randomly, of the available data. For instance if we have one million instances of a metric along with the million sets of the predictor data, we might select at random 70% of the data an use it for training, and use the remaining, randomly selected 30% of the data to test the result. At the end of the process, it is the performance on the test data we are concerned with. It is common in machine learning competitions that there is a set of data provided for training, and another set for which the dependent values are not provided, and the competitor is to submit their best predictions of the dependent values in the test set. In such cases, it is common to then further split the provided data with known values into a training and test set to find the best solution, then use that on the unseen data to submit to the competition. Another subtlety is that if we are trying to model a time series, typically the test data are taken as a subset of data of the most recent times. There are many suggested approaches to train test spits for both classification and time series data, but this is outside our scope in this article.
A common misconception is that we cannot “understand” how deep learning with neural networks “works”. This misconception arises from the observation that we can train huge networks with millions of neurons to recognize images or determine factors that can be used to classify other inputs. In such applications, it has been surprising to some how successful these methods are, lending the approach an air of mystery. In fact, the output of a trained network is deterministic, and the training process is mathematically represented by a closed set of equations. I think the challenge is to understand why deep learning works, rather than how.
If we follow one path backward from the output, the impact on the output of the prior layer parameters can be described by partial derivative terms, allowing computation of the partial impact of one weight on the output as a function of that weight. This in turn allows updating the weights to move in the direction of lower error.
As this weight adjustment phase is working backwards, it is called back propagation—the error is propagated back through the network and used to update the weights. In general, all the weights are updated at once in an iteration. This mathematical process can be represented very compactly using matrix methods (called linear algebra in the mathematical sciences). There are very efficient matrix mathematics algorithms in most computer programming languages, such as R, Matlab®, C++, etc. neuralnet and deepnet use features in the R language to do the updates. The process of updating the weights is often referred to as gradient descent. The idea is that the partial derivatives are the local slope, or gradient, of the error function with respect to a weight. Since we want to minimize the error function, we want to walk in a downhill direction, hence the descent.
Deep Learning flavors and approaches
Having described the process as deterministic and simple mathematically, it turns out that like most problems, there are methods of solution that are better than others. Standard back propagation proceeds more or less identically to the description above. There are more sophisticated algorithms that have been found to work better in many cases, where better means either more likely convergence and/or faster convergence. I won’t review here the range of advanced approaches to improve performance of neural networks. A useful reference article (although heavy in math!) by Sebastien Ruder is here. (Note that the number of labels can be distracting. You may see any of the following and all are at least partially under the umbrella of neural networks: back propagation (backprop), gradient descent, batch gradient descent, stochastic gradient descent (SGD), deep neural networks (DNNs), recurrent neural networks (RNNs), convolutional neural networks (CNNs), partially connected networks, fully connected networks, momentum, decay, adaptive learning rate, resilient backpropagation, weight backtracking, , and a range of specific advanced algorithms including NAG, Adam, RMSprop, Rprop, etc.)
Package feature and differences
neuralnet has many attractive options, including selection of several algorithms to update weights, and a nice network visualization function that can be called once a model is built (used to generate the network in figure 1). deepnet has fewer options, but three important options not present in neuralnet—the first is the ability to use dropout to reduce over-training. Dropout is a method that “turns off” a sample of neurons in a given update; typically the fraction of neurons turned off is a hyperparameter (see below for a working definition of hyperparameter). Use of dropout has been shown to reduce over-training (or over-fitting), which is a behavior of many machine learning models wherein using more iterations reduces the error of matching the training data but degrades performance of predicting using new (aka unseen) data.
The second valuable feature available in deepnet is the ability to exclude some neurons from being included in the calculation of the updates. The ability to exclude some neurons allows simulating more complex networks that are not fully connected networks. Different network topologies than fully connected neural networks are an ongoing area of research, and may be very important for certain types of problems.
Lastly, deepnet allows a common hyperparameter to adjust the “batch size” used to update the weights. If I define an “epoch” as the process to use all the available data in one forward pass and one backward pass to update the weights, then a batch describes using a subset of the data instead of all the data. There are a number of approaches in the literature ranging from using as few as one instance per update (which may be called stochastic gradient descent), or a small subset (which may be called mini-batch). There are papers describing advantages of using different batch sizes, including arguing for small batches to speed up convergence (especially for large datasets), and another showing the opposite in some cases (see: https://arxiv.org/abs/1711.00489). For the purposes here, I use the default value (100 instances) for most comparisons, then compare performance a function of batch size for one set of parameters.
Here, I use only the standard backpropagation algorithm in both packages, no dropout, and only fully connected networks to focus on the hyperparameter questions of learning rate and decay. (Look for future articles on the benefits of advanced neural network algorithms and advanced features like dropout and partially connected networks.) Also, I do not address momentum in this article (see below).
From this brief description, I can segue to some practical matters involved in actually training such networks. As already noted, there are a set of factors that are used to tweak the training process, commonly called hyperparameters. Perhaps the most important is the learning rate. The learning rate is simply a factor used to reduce the size of the weight updates. Below I will show why this is important, but in general, since the updates are calculated using partial derivatives, the impact of changes in the other weights isn’t included when estimating the weights for a given neuron. Shrinking the update helps avoid “chasing our tail” where the changes are too large and introduce more error when combined with all the other updates. Another hyperparameter is momentum, which carries forward gradient information from the prior update to improve the convergence rate when moving along an extended narrow “canyon” in the hypersurface. I don’t address momentum in this article. A third important hyperparameter is learning rate decay. This factor is used to reduce the learning rate on each iteration; I will show why this matters below.
Standard backpropagation is the method most often taught in introductory machine learning regarding deep learning or neural networks. It is also very likely to suffer from challenges in converging to a global optimum. A full discussion of these issues is outside the scope here, but essentially standard backpropagation easily can converge to local minimum of the error function, or not converge. Figure 4 idealizes the n-dimensional surface along which we search for weights that converge on the minimum of the error function.
Figure 4. Simplified gradient descent diagram.
The grey line in figure 4 is the “surface” of the error function across the space of all possible neuron weights. In reality, this surface is a hypersurface of many dimensions (hundreds to millions). Here, you can view the grey curve as a cross section in one dimension of the multidimensional surface. Our task in training the network is to find the set of weights which produces the global minimum error function value. If for illustration we assume this cross section is taken to include the global minimum, then the pink path is the only one that will converge to the global minimum. The green path won’t converge at all, and the red path is hard to predict. Only by using learning rate decay (reducing the rate on each iteration) do the pink and blue paths converge. The blue path converges quickly, but to a local minimum, not the global one.
Most implementations of neural networks use a randomized starting point. It is easy to see that given a starting point, it is possible (maybe likely) to take such large steps (changes in the weights) that we randomly jump to another local feature of the hypersurface. On the other hand, if our steps are very small, either the network converges to a local minimum, or it converges very slowly, or both. Also, if we find ourselves in a concave feature of the hypersurface, it is possible to get stuck, bouncing back and forth. These issues are often addressed by reducing the learning rate at each iteration. The learning rate is essentially the size of the step we take in the hyperspace; this is represented by the length of the arrows in figure 4. It is easy to see the intuition that decaying the learning rate over time can help to ensure convergence.
Figure 4 and the descriptions are only a few of the possible behaviors. Because the system is multi-dimensional, the progress towards convergence can move at different rates over the iterations, and it is possible to move between local minima and other complex behaviors. Below we’ll show a case that may appear non-intuitive but is an actual outcome of trying to train a network.
All of this discussion about learning rate and convergence has so far avoided the question: how we detect convergence and stop training? The simplest approach is to define a fixed number of iterations, run that many, and inspect the results. The deepnet algorithm is designed in this way. If we don’t like the results, we can change the iterations limit and try again. Another approach is to define the value of the error function we would like to reach or get below, and iterate until that criterion is met. Interestingly, it has been found that in some cases, just doing more and more iterations may not be the best approach. If we calculate the error of the trained network predicting new data, sometimes the error gets better (smaller) as we increase iterations, up to a point, then gets worse. Note that the error on the training data continues to improve, so that is an incomplete criterion. The idea of stopping before the error on the training data is minimized, but rather when the error on the test data is minimized, is called early stopping.
neuralnet uses a stopping criteria not based on a fixed number of epochs, and not based on the error function, but the values of the partial derivative terms. An interesting aspect of setting the stopping criteria this way is that the resulting RMSE on the test data tracks the threshold, so using this method is a nice way to set the accuracy you want to achieve and let the algorithm automatically run until that is met.
A value named threshold in neuralnet is the maximum value of all the derivative terms that we will accept as convergence, and the algorithm continues until this criterion is met, or until the max steps is reached. In the latter case, the algorithm does not return a model. neuralnet does not report the error function on each iteration, instead reporting the maximum threshold value. The frequency of reporting is determined by a variable that can be adjusted. Because neuralnet does not report error function values, it can be confusing when first using the package. The reason is that while the error function should monotonically decrease on each iteration (with properly selected hyperparameters), the maximum derivative may not decrease. This can give the appearance of non-convergence, but that may not be the case. For this article, I tweaked the threshold values in neuralnet and the max iterations in deepnet to compare apples to apples.
The problem I use to illustrate convergence behavior in this article is shown in figure 5.
Figure 5. A time series is observed as shown in the black curve, representing, for example, sales over time. Assume we believe the main contributors to sales are a long-term upward trend in the economy (green symbols), and our sales pipeline in the past (the orange symbols).
For this example, I synthesized the sales history as a non-linear combination of the two trends, as shown in the equation on the figure. I then added noise to the two predictors, to challenge the model by including some other unknown factors, incomplete description of the relationships, etc. Noise was added as random values to both the training data and the test data after calculating the values to be predicted.
As a baseline, figure 6 shows predicting 500 future days using a linear model with the training data shown in figure 5.
Figure 6. Results of using a linear regression model on the training data and predicting 500 days into the future. There are periods of significant deviation as well as an apparent phase shift (the peaks and valleys seem shifted from the target data).
Figure 7 shows some results using deepnet. It is evident even at this stage we have improved on the predictive power by using the neural network vs. linear regression.
Figure 7. Baseline convergence with a simple network having one hidden layer with 5 neurons, an initial learning rate of 0.5 and a learning rate decay of 0.999986.
At the top of figure 7 is the resulting prediction for future dates overlaid on the actual (calculated using the same functions as before, projected into the future). It is evident that the network learns major features of the time series behavior, as well as that it learns some of the noise. In the lower portion of figure 7, on the left is the raw error function reported by the algorithm vs. iterations; on the right is the resulting RMSE on the test data (the times > 500). It can be seen that there are some anomalies in the convergence where the error function increases; these are due to the complex hypersurface the algorithm is navigating. In terms of the RMSE, the behavior is more monotonic, but with these starting parameters, the prediction initially gets worse before declining monotonically. This could be taken to indicate less than ideal hyperparameters (learning rate 0.5 and decay 0.999986).
If we explore the learning rate and decay space in more detail, we see the behaviors shown in figure 8.
Figure 8. Behavior of deepnet convergence as a function of initial learning rate, decay, and iterations. The curves are mean log trends across the learning rates in each group.
Here we can see the sensitivity of the solution to parameter choices. In particular, although in most cases the convergence is smoothly downward in error function, in the upper left series we can see that an unfortunate choice of initial learning rate and decay can result in non-convergence. Although small, the two upper charts also show that there may be an optimum in decay in some cases. Keep in mind that with hundreds of thousands of iterations, even the decay of 0.99999 is very significant compared to no decay.
As noted earlier, although when the weight values converge, to either a local or global minimum, the error function will continue to decrease as we increase iterations. However, what we really want is the best prediction on new/unseen data. Figure 9 shows that if we monitor the performance as RMSE on the test data, a different picture emerges.
Figure 9. Trend of root mean squared error (RMSE) on the test data vs. initial learning rate and decay.
We can see that significant differences can occur in results by small changes in the learning rate decay. In addition, the RMSE tends to be minimized by careful choice of the initial learning rate. Making the right choices can improve the prediction error by more than a factor of two. As noted earlier, decreasing error function does not guarantee decreasing RMSE and could even lead to worse RMSE. For deepnet, figure 10 shows results of extending iterations to 600,000.
Figure 10. deepnet convergence behavior at higher iterations.
What we see in figure 10 is that the error function is still decreasing at 600,000 iterations, but the RMSE on the test data largely levels off around 350,000 iterations. Although the error function is decreasing, the performance on the training data varies only slightly. Figure 11 shows that the main impact is a very slight reduction in bias in some ranges of the data. This illustrates the recommended practice of early stopping, where the test RMSE is monitored and training stops when it ceases to significantly improve or begins to get worse (see, for instance, “Deep Learning”, a NIPS 2015 Tutorial by Hinton, Bengio, & LeCun). A challenge is that many of the available R packages don’t facilitate such monitoring. For this work, I wrapped a loop around the function calls to the training algorithms, and stepped through csv files with the run parameters, so that I could stop and capture the intermediate values. This method is highly inefficient, leading to the likelihood that source code must be modified to efficiently use the packages.
Figure 11. When iterations are increased from 300,000 to 600,000 using deepnet,
A similar conclusion can be drawn by viewing the histograms of the residuals. The residuals are the difference between the model predicted values and the actual values. It is always a good idea to review the residuals to see if something is going wrong in the model training; in general a good model should have minimal bias which means the distribution of the residuals should have a maximum at zero. Figure 12 shows the corresponding histograms to figure 11.
Figure 12. Increasing iterations shifts the histogram by an RMSE of about 0.25.
Now that I have completed the parameter investigation using deepnet, I can compare to neuralnet very easily. Since neuralnet does not offer the learning rate decay hyperparameter, the task is to investigate the learning rate dependence, then compare the performance of the model training algorithm to deepnet.
A quick initial comparison is to look at the ability to fit the training data. We expect this should be similar for similar levels of convergence (resulting RMSE). Figure 13 shows that this is roughly true, but it appears deepnet takes more iterations to achieve the same level of fit error.
Figure 13. Fit by deepnet at 600,000 iterations results in a similar RMSE on the training data as achieved by neuralnet in 300,000 iterations. The fit behaviors are more or less identical.
However, recall that deepnet uses a batch size hyperparameter, which I kept at the default of 100. The way deepnet reports iterations, an iteration is one forward and backward pass of a batch. The dataset was 500 points long, which means five of these iterations is required to complete an epoch, which is one training cycle through all the data. neuralnet does not provide the batch size hyperparameter, which implies an epoch of neuralnet is using about 5 times the calculations that deepnet does with a batch of 100. I took all the data of RMSE values vs. iterations for both algorithms, and plotted the RMSE of deepnet vs. the RMSE of neuralnet, when both are at the same epoch. Figure 14 shows the results.
Figure 14. When compared on equivalent epochs, the two algorithms have a constant proportion to one another. deepnet tends to start higher (about 0.5 units in RMSE), and it moves a bit more slowly (about 75% of the convergence rate, in epochs, of neuralnet).
It is satisfying that the same underlying math generates self-consistent results. Nonetheless, deepnet takes more epochs than neuralnet. This leads to the next comparison, which is the time each algorithm takes.
Figure 15. deepnet epochs take much longer than neuralnet.
What we see is that neuralnet is much faster than deepnet, and the difference grows It turns out that neuralnet time is linear with epochs, while deepnet is roughly cubic time in epochs. This difference was quite noticeable on the system used to do the comparisons. (caveat: the system isn’t very high performance; it is possible the differences could be less on a faster machine [Intel Core i7 @2.4 GHz, 8 GB RAM, no GPU used]) Figure 16 shows another view of neuralnet behavior, looking at RMSE on the test set vs. the threshold parameter. Recall that earlier I noted that as neuralnet uses the threshold, which is the largest partial derivative value on an update pass, as the stopping parameter, and said that the RMSE of the test data is linear with the threshold.
Figure 16. neuralnet converges to higher RMSE on the test data as the threshold stopping parameter increases. At very low values of the threshold, it appears to reach diminishing returns, and below about 0.005 no further improvement in the predictive performance is noted. On the other hand, at high enough values the performance becomes again constant; this is an artifact of the particular data, and network topology such that above a certain threshold, the derivatives drop below the limit after relatively few iterations, but may not represent a properly trained model.
Based on the behavior in figure 16, if we started a parameter search at too high a threshold, and began lowering it, we could conclude, incorrectly, that we were at an optimum.
In the bulk of the comparisons, I found a learning rate of 0.001 to 0.0015 was needed to get good performance. It is evident that the definition of the learning rate differs in the two packages, which is another key point when using various open source packages—you need to explore enough to understand typical parameters. Using a learning rate similar to that for deepnet diverged and produced no results.
The convergence behavior of neuralnet on this problem has some interesting aspects. In particular, measured both as the error function and as RMSE, the algorithm initially seemed to plateau, then jumped to lower error values and continued to converge slowly, as shown in figure 17.
Figure 17. neuralnet seemed to be converge around 30,000 epochs, then jumped to another path and converged slowly. As noted already, neuralnet converges in fewer epochs and much less time than deepnet, so it was not a penalty to test higher numbers of epochs. Not shown explicitly, neuralnet also tended to converge to lower RMSE on the test data at the limit of high epochs (neuralnet achieved RMSE around 0.6 as compared to 0.8 for deepnet).
Most of the results shown so far for neuralnet were obtained using a learning rate of 0.001. neuralnet has a clear optimum as compared to deepnet. Figure 18 shows the epochs and time needed to reach a fixed level of convergence close to the results shown so far at around 300,000 epochs.
Figure 18. neuralnet shows clear optimum for learning rate when progress is stopped well past when improvement has leveled off (around 300,000 epochs in this case).
neuralnet was more robust, in general, to changes in learning rate. At any given target convergence RMSE, it would converge to a solution across a range of learning rates, and outside the acceptable range simply would not converge. As another way to look at the convergence optimization, figure 19 shows a 3D surface of the time to converge for neuralnet as a function of the stopping threshold and the learning rate. In the same figure, a similar surface is shown for deepnet, but with slightly different parameters.
Figure 19. convergence behavior of neuralnet (left) and deepnet (right).
On the left of figure 19, the surface shows the time for neuralnet to converge as a function of the learning rate and the threshold value of the maximum partial derivative term. Red/orange signifies larger time, and purple/grey signifies smaller time. Recalling that the threshold is roughly proportional to the RMSE achieved (using neuralnet neuralnet) for the test data (the unseen data that was not used for training), this chart shows that if the learning rate is too low for a given target, it takes longer to converge; however this is a broad region where convergence is fast (the large grey basin). As the threshold is increased, which means allowing a higher final RMSE, the time to converge generally is reduced, except at extremely large values of the learning rate.
It is important to note that there are some regions of the neuralnet parameters where the starting parameters can land in a complicated part of the surface, and can lead to confusing results. This exemplifies the challenge of finding parameters that “work”.
On the right of figure 19 I show a surface that is roughly the equivalent of the neuralnet example, but for deepnet. Learning rate has the same meaning, and as deepnet uses only the steps (epochs) as the stopping criteria, I replace the threshold axis for neuralnet with the steps axis for deepnet. As long as we are in a region where deepnet converges, these are roughly equivalent. The vertical dimension is the resulting RMSE on the test data. I chose this metric because of the observed non-linear (approximately cubic in steps) growth of the time vs. steps using deepnet. In general, as steps increase, accuracy improves, but as already shown, there are diminishing returns. Similarly, as the learning rate increases, convergence to a given RMSE is faster. However, this surface is more complicated as there are regions where the RMSE can fluctuate for a time as steps increase. Depending on the stopping method and where search was started, it is very easy to end in a local pseudo-optimum when using deepnet. Like neuralnet, if the starting learning rate is very small, there are regions where it can take a long time to converge.
I have shown the behavior of two available R packages used to train and predict with a simple neural network applied to time series data. A synthesized data set was used with a single hidden layer of 5 neurons to train a model, and the model was used to predict unseen data in the test data set. The goal was to illustrate the challenges that can arise with the choice of hyperparameters to achieve an acceptable solution in an acceptable time. The non-linearity of the error function hypersurface requires choices of convergence control parameters that are “good enough” to allow convergence to an acceptable solution. I focused on only a few parameters: learning rate, learning rate decay, and number of epochs. Nonetheless, the behavior is fairly complex and can be counterintuitive if starting in a “bad” location and with “poor” values of hyperparameters.
There are many strategies to deal with these challenges. My hope is that these simple approaches pique your interest to dig deeper and understand the practical behavior of neural networks and that such understanding leads you to consider practical applications of neural networks to your predictive use cases. In my work I have used Rprop in the neuralnet package. If you are concerned that neural networks are too complex to use, be encouraged that approaches like Rprop and others achieve very significant improvements which also make practical applications easier.
neuralnet can be obtained at: https://CRAN.R-project.org/package=neuralnet
deepnet can be obtained at https://CRAN.R-project.org/package=deepnet
Originally posted here.