]]>

]]>

Summary: Are you wondering about moving up to Automated Machine Learning (AML)? Here are some considerations to help guide you. Are you wondering about moving up to Automated Machine Learning (AML)? Or perhaps you’ve already made the decision but are wondering about the capabilities of individual platforms, their strengths and limitations and how to choose. Here are some considerations to help guide you.What’s Your Motivation?This is intended to be a little broader than business case and requirements. Chances are your broader motives fall into one or more of the following buckets and can certainly involve more than one at the same time.EfficiencySo far the greatest motivation behind AML adoption has belonged to companies who are already deploying large numbers of ML models. If you’re creating and managing dozens or even hundreds of models as is frequently the case in insurance, banking, and ecommerce then the ability to create more models and keep them refreshed is an obvious issue. Cost savings are a top motivation as fewer data scientists can now do the work of many. Speed, that is time to benefit is also greatly enhanced especially in the model refresh and deploy cycle.Broader ParticipationBe aware that many of the up and coming AML platforms differentiate themselves based on audience. Those that appeal to your existing data science team offer easier and more complete access to choices in data prep, feature selection, model selection, and model tuning with their hyperparameters.The larger emerging camp seeks to make the process much easier for less experienced modelers. On the one hand this can be your first year data science hires who will rely more on the automated features than perhaps the more experienced team members. On the other hand there are platforms so completely automated that they encourage LOB Managers, analysts, and other citizen data scientists to participate directly in model building.Having more people directly participating in model building can seem like a very desirable objective. Be sure you have sufficient controls to prevent putting models into operation that haven’t been fully vetted by your experienced data scientists. It’s still possible for the operator of a fully automated tool to create a model that’s not sufficiently accurate, won’t generalize, or worse, predicts exactly the wrong thing.Just Getting StartedIf you’re just getting started on your digital journey and don’t yet have a dedicated data scientist or two, you might be tempted to sign up for an AML and give your LOB Managers and analysts enough training to get started. Don’t go there.As in the last section, it’s still possible for an inexperienced modeler to create a model that will leave you worse off than having no model at all. You’re going to need some quality control before you turn new models loose on your customers or processes. How Much Does Accuracy CountIn machine learning there is always a practical tradeoff between model accuracy and time to develop. Your data scientists will no doubt be happy to continue to deliver increasing incremental gains in model accuracy for days or weeks. Still, it’s important to understand the tradeoff between model accuracy and revenue or margin. It’s not unusual for small gains in accuracy to create proportionately much larger gains in campaign results.Your data science team lead no doubt understands this and has already put some controls in place. The real issue is whether the automated output of the AML platform meets your minimum requirements.Determining this will require some benchmarking during the selection process so that you have side-by-side comparisons. Most all AML platforms use multiple algorithms and teams of algorithms run in competition with one another to select the winners. Accuracy within the AML may be less than optimum if the number of candidate algorithms is restricted to just a few. It’s just as likely however that any shortfall in accuracy may have occurred in the automated data prep, cleansing, feature engineering, and feature selection. You’ll need experienced members of your data science team to help you evaluate this issue. Basic Feature SetAt this stage in market maturity, any AML you consider should offer all of the following automated capabilities:Data Blending: The combination of data from different sources into a single file. This still requires the operator to specify things like inner or outer joins of data sets. The most advanced platforms may also be able to detect whether the data from two different sources with the same name (e.g. ‘sales’) has the same meaning. At this point however it’s best to have either really robust data governance (and not many do) or to have modelers sufficiently intimate with the data that they can detect this sort of mismatch.Data Prep and Cleansing: In this category is automated correction of data in incompatible formats (dates, values with embedded commas, etc.) Most AML data prep platforms do a good job at this. Cleansing is more complex. It involves for example the identification of outliers and how they are to be treated, the correction of badly skewed distributions, the conversion of categoricals into independent features, or even the compression of data ranges (typically -1 to 1) to create data sets as required for some specific types of algorithms like neural nets.Feature Engineering: In concept feature engineering is simple. For example converting related variables into ratios (e.g. debt to income) or dates into number of days since other events have occurred (age of the account, days since last purchase, etc.). In automated form this frequently requires the AML to create all possible combination of these artificial features without regard for whether they are logical, and then let the algorithms figure out which are predictive (typically only a small fraction). Depending on how this is handled in the AML this can add a very large amount of compute overhead. You’ll want to examine if this step creates any unforeseen requirements in time or compute cost.Feature Selection and Modeling: These are traditionally thought of separately but I’ve combined them here as AML platforms might. In traditional modeling feature selection can be a separate step that precedes model creation to make the modeling process more accurate and efficient. However, it’s also possible to have the models consider all possible features and to automatically eliminate those which are least predictive. Automated modeling typically involves running parallel contests on the data with different algorithms. During the contests the AML should also be varying the hyperparameters of the different models to attempt to achieve an optimum result. How feature selection, modeling, and hyperparameter tuning is handled by the platform will require your detailed attention during trials.Model Deployment: Your AML should be able to automatically generate production code in your choice of language compatible with your operating systems (typically Python, C+, Java, or other popular production languages). Model Management and Refresh: The first time you deploy a model in your operating systems you will need to define exactly where it goes. Thereafter a complete AML should be able to monitor the model, determine when a refresh is appropriate, and with minimum human intervention refresh the model and automatically redeploy it. There are human quality control verifications in this process but once the model has been developed, refresh and redeploy should require only a small fraction of the original labor for original development. Some Advanced ConsiderationsAutomation of the Entire Process: In a fully automated system, particularly one focused on maintaining and refreshing existing models it’s important that the entire process can be programmatically defined. In this way the entire process from data capture through deployment and all the customized steps in between can be captured and repeated making the end-to-end process truly automated.Data Types: Depending on your business you may have a variety of data inputs that may have special needs including unstructured or semi-structured text or image data, or streaming data. A few AML platforms can handle these more advanced requirements. A few AML platforms already have the ability to create deep learning CNN and RNN models though this type of modeling is not yet common in business.Prepackaged Automation Libraries: During initial model development your data science team will have identified specific steps in the process that need particular attention. These might include data prep, feature selection, or hyperparameter optimization. Ideally your AML platform will include libraries or APIs of callable solutions that can shortcut data scientist labor on these tasks.Training Data Requirements: Some algorithms that might be considered during the competition for best model may be particularly data hungry. You will want to understand the tradeoffs between including these algorithm types against the availability or cost of acquisition of sufficient training data.On Premise Solution: Some AML platforms that are particularly compute intensive (as many are) are optimized for a SaaS cloud delivery solution. If your business requires an on-prem or private cloud solution for data security you’ll need to identify the cost and complexity of this option.While AMLs are positioned for their simplicity, there are many factors to be considered before jumping in. You’ll want help from your data scientist pros in selecting the right one. Other articles by Bill Vorhies About the author: Bill is Contributing Editor for Data Science Central. Bill is also President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist since 2001. His articles have been read more than 2 million times.He can be reached at:Bill@DataScienceCentral.com or Bill@Data-Magnum.com See More

]]>

This post is a part of my forthcoming book on Mathematical foundations of Data Science.In the previous blog, we saw how you could use basic high school maths to learn about the workings of data science and artificial intelligenceIn this post we extend that idea to learn about Gradient descentWe have seen that a regression problem can be modelled as a set of equations which we need to solve. In this section, we discuss the ways to solve these equations. Again, we start with ideas you learnt in school and we expand these to more complex techniques. In simple linear regression, we aim to find a relationship between two continuous variables. One variable is the independent variable and the other is the dependent variable (response variable). The relationship between these two variables is not deterministic but is statistical. In a deterministic relationship, one variable can be directly determined from the other. For example, conversion of temperature from Celsius to Fahrenheit is a deterministic process. In contrast, a statistical relationship does not have an exact solution – for example – the relationship between height and weight. In this case, we try to find the best fitting line which models the relationship. To find the best fitting line, we aim to reduce the total prediction error for all the points. Assuming a linear relation exists, the equation can be represented as y = mx + c for an input variable x and a response variable y. For such an equation, the values of m and c are chosen to represent the line which minimises the error. The error being defined asImage source and section adapted from sebastian raschkaThis idea is depicted above.More generally, for multiple linear regression, we havey = m1x1 + m2x2 + m3x3when expressed in the matrix form, this is represented as y = X . mWhere X is the input data, m is a vector of coefficients and y is a vector of output variables for each row in X. To find the vector m, we need to solve the equation.There are various ways to solve a set of linear equations The closed form approach using the Matrix inverseFor smaller datasets, this equation can be solved by considering the matrix inverse. Typically, this was the way you learned to solve a set of linear equations in high schoolSource matrix inverse – maths is fun This closed form solution is preferred for relatively smaller datasets where it is possible to compute the inverse of a matrix within a reasonable computational cost. The closed form solution based on computing the matrix inverse does not work large datasets or where the matrix inverse does not exist.Instead of considering the matrix inverse, there is another way to solve a set of linear equations. Consider that the in the above diagram for the X and Y axes, the overall problem is to reduce the value of the loss function. In the above diagram, the cost function J(w), the sum of squared errors (SSE), can be written as: Hence, to solve these equations in a statistical manner (as opposed to deterministic), we need to find the minima of the above loss function. The minima of the loss function can be computed using an algorithm called Gradient DescentThe Gradient descent approach to solve linear equationsTypically, in two dimensions, the Gradient descent algorithm is depicted as a hiker (the weight coefficient) who wants to climb down a mountain (cost function) into a valley (representing the cost minimum), and each step is determined by the steepness of the slope (gradient). Considering a cost function with only a single weight coefficient, we can illustrate this concept as follows: Using the Gradient Decent optimization algorithm, the weights are updated incrementally after each epoch (= pass over the training dataset). The magnitude and direction of the weight update is computed by taking a step in the opposite direction of the cost gradient. In three dimensions, solving this equation involves computing the minima of the loss function expressed in terms of slope and the intercept as below http://ucanalytics.com/blogs/intuitive-machine-learning-gradient-descent-simplified/Use of Gradient descent in multi-layer perceptronsYou saw in the previous section, how the Gradient descent algorithm can be used to solve a linear equation. More typically, you are likely to encounter gradient descent to solve equations for multilayer perceptrons i.e. deep neural networks (where we have a set of non-linear equations). While the same principle applies, the context is different as we explain below In a neural network, like the linear equation, we also have a loss function. Here, the loss function represents a performance metric which reflects how well the neural network generates values that are close to the desired values. The loss function is intuitively the difference between the desired output and the actual output. The goal of the neural network is to minimise the loss function for the whole network of neurons. Hence, the problem of solving equations represented by the neural network also becomes a problem of minimising the loss function. A combination of Gradient descent and backpropagation algorithms are used to train a neural network i.e. to minimise the total loss functionThe overall steps areIn forward propagate, the data flows through the network to get the outputsThe loss function is used to calculate the total errorThen, we use backpropagation algorithm to calculate the gradient of the loss function with respect to each weight and biasFinally, we use Gradient descent to update the weights and biases at each layerRepeat above steps to minimize the total error of the neural network.Hence, in a single sentence we are essentially propagating the total error backward through the connections in the network layer by layer, calculate the contribution (gradient) of each weight and bias to the total error in every layer, then use gradient descent algorithm to optimize the weights and biases, and eventually minimize the total error of the neural network.Explaining the forward pass and the backward passIn a neural network, the forward pass is a set of operations which transform network input into the output space. During the inference stage neural network relies solely on the forward pass. For the backward pass, in order to start calculating error gradients, first, we have to calculate the error (i.e. the overall loss). We can view the whole neural network as a composite function (a function comprising of other functions). Using the Chain Rule, we can find the derivative of a composite function. This gives us the individual gradients. In other words, we can use the Chain rule to apportion the total error to the various layers of the neural network. This represents the gradient that will be minimised using Gradient Descent.A recap of the Chain Rule and Partial DerivativesWe can thus see the process of training a neural network as a combination of Back propagation and Gradient descent. These two algorithms can be explained by understanding the Chain Rule and Partial Derivatives. The Chain RuleThe chain rule is a formula for calculating the derivatives of composite functions. Composite functions are functions composed of functions inside other function(s). Given a composite function f(x) = h(g(x)), the derivative of f(x) is given by the chain rule asYou can also extend this idea to more than two functions. For example, for a function f(x) comprising of three functions A, B and C – we have f(x) = A(B(C(x)))The chain rule tells us that the derivative of this function equals:Gradient Descent and Partial DerivativesAs we have seen before, Gradient descent is an iterative optimization algorithm which is used to find the local minima or global minima of a function. The algorithm works using the following stepsWe start from a point on the graph of a functionWe find the direction from that point, in which the function decreases fastestWe travel (down along the path) indicated by this direction in a small step to arrive at a new pointThe slope of a line at a specific point is represented by its derivative. However, since we are concerned with two or more variables (weights and biases), we need to consider the partial derivatives. Hence, a gradient is a vector that stores the partial derivatives of multivariable functions. It helps us calculate the slope at a specific point on a curve for functions with multiple independent variables. We need to consider partial derivatives because for complex(multivariable) functions, we need to determine the impact of each individual variable on the overall derivative. Consider a function of two variables x and z. If we change x, but hold all other variables constant, we get one partial derivative. If we change z, but hold x constant, we get another partial derivative. The combination represents the full derivative of the multivariable function.ConclusionIn this post, you understood the application of Gradient descent both in neural networks but also back to linear equationsPlease follow me on Linkedin – Ajit Jaokar if you wish to stay updated about the book ReferencesKen Chen on linkedinunder the hood of neural networks part 1 fully connectedSee More

]]>

]]>

]]>

]]>

]]>