The objective of this paper is to present the process of building a model for identifying the right combination of inputs for optimizing the Concrete Compressive Strength. Multiple machine learning algorithms were evaluated. A process of optimizing the solution using Ensemble Learning was identified and successfully tested. These are metaheuristic techniques used to improve and combine the predictions of multiple learning algorithms.
The model building was done in the following stages:
This work is the outcome of a comprehensive prototyping and proofofconcept exercise conducted by Tirthankar Raychaudhuri, Sankaran Iyer and Avirup Das Gupta at Turing Point (http://www.turingpoint.com/) a consulting company focused on providing genuine Enterprise Machine Learning solutions based on highly advanced techniques such as 3D discrete event simulation, deep learning and genetic algorithms.
Here is the link to the paper
©Copyright Turing Point Pty Ltd
Material in this paper may not be used for commercial purposes without the written permission of Turing Point Pty Ltd.
Machine Learning (ML), a branch of Computer Science that focuses on drawing insights and conclusions by examining data sets, is an increasingly popular discipline today in resolving enterprise business issues. However the field is vast and consists of numerous algorithms and approaches. Data sets are also often complex and require to be preprocessed before an ML algorithm can be 'trained' to learn from such data. For a particular problem domain and data set, defining the preprocessing technique and selecting the ML algorithm (or set of algorithms) is still largely 'an art rather than a science' depending on the knowledge and skills of the expert/data scientist in question. With time this will change and scientific guiding principles/best practices will emerge to preprocess data and to select appropriate algorithms for a particular problem domain  as the discipline matures.
In the meanwhile we have conducted a study of applying the socalled 'ensemble learning' approach to a dataset.
Ensemble learning is the process by which multiple models, such as classifiers or experts, are strategically generated and combined to solve a particular computational intelligence problem. It has been found that an ensemble can often perform better than a single model. This process is similar to important decisions we take in day to day situations.
An investment in equity may require consulting multiple analysts for their expert opinion. Each one may look at it from a different angle. It may also be good to consult the opinion of friends and relatives. Finally a consensus decision is taken.
The election of an office bearer from potential candidates is the result of the maximum number of votes cast by the members.
One may need to consult multiple real estate analysts to decide on the right price for a property.
The individual models contributing to a decision may differ due to a number of factors:
Section 5.7 lists the Ensemble learning algorithms used in this paper. As discussed in Section 4.2, Ensemble Learners address some of the model issues like biasvariance trade off.
The building of a Machine Learning model is a complex process. A right algorithm or an ensemble of them needs to be chosen from a plethora of available algorithms.
The output of the model can be broad classification like trying to identify the type of car from the features, or it can be a continuous or Regression value instead of being discrete items. Often the solution depends on the complexity of the problem being addressed. In some situations a simple linear model may be sufficient but in other situations a complex combination may be warranted.
The Concrete Compressive Strength use case being addressed by this paper is a complex Regression problem. Hence it required only algorithms that can address a problem of this type. The model selection process was addressed in 4 stages.
The purpose of this paper is to build a Regression Model for the Concrete Strengthening Process. The description of the process and the data set can be found in the following link:
http://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength
This is a free and a complex dataset available from the Machine Learning Repository of Centre of Machine Learning and Intelligent Systems at University of California Irvine
Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag,
fly ash, water, superplasticizer, coarse aggregate, fine aggregate and age. The following are the list of data attributes. The Concrete Compressive Strength is the last attribute which is the desired output combining
the inputs
Figure 1 illustrates the Concrete Compressive Strength Process
Figure 1 Block Diagram of Concrete Compressive Strength Process
The objective is to model Concrete Compressive Strength as a function of these input variables.The dataset contains 1030 measurements.
The objective of any machine learning systems is to emulate the real time behavior as a function of the independent variables or predictors. In order to do this the behavior is modeled with some training samples and verified against some test samples and released with the hope that the resulting solution will do a perfect job predicting the outcome of any unseen test data. The confidence in the model will be high if the training data contained samples representative of all the variations of the real world. However, there can be practical limitations in getting data sets. It may not be possible to get samples of all possible variations thereby constrain the perfection of the model
Hence it is possible to work only what is available with right processes in place to build as perfect a model as possible. Assumptions have to be made that the data is independent and identically distributed. The data set is randomly split into training and test data. The test data is used for verifying the performance only and is not to be used for any model building process. Typical split ratios are 60:40, 70:30, 80:20 or even 90:10.
For our model this ratio has been deliberately kept at 50:50 in order to increase the confidence in the resulting model. The 1030 tuples of data set was split randomly into Training and test sets each having 515 tuples. The test data was kept aside and used only for testing purpose. No change was done to the models after testing
A bias is the difference between the expected value and the actual value of a variable. This is an important measure for a machine learning which is concerned with predicting dependent or target outcomes from independent variables or predictors. Thus if “y” represents the actual value and “E(y’)” the estimated value then
Variance of an estimator y’ is the expected value of the square of the difference from its mean E(y’). Thus
^{}
Ideally in a perfect world one would seek a model with zero bias and zero variance. But this is hardly the case. A model may be trained to have a low bias with the training data but can perform poorly resulting in high variance with the test data. In such a situation , the estimator is considered to be overfitting to the training data including the noise in it as well. On the other hand, a model may have a high bias in which case may be simpler and underfit the training data but may have a relatively lower variance with test data.
Hence there always has always to be bias – variance trade off, in that the bias need not be too low with training data that would result in high variance with test data. This calls for a fairly complex model.
Given these requirements of relatively low bias and not so high variance on test data, the next step is to evaluate models and compare their performance. Multilayer Perceptrons perform well in complex situations and it was decided to try out deep learning models various combinations of hidden layers and compare their performances. Ensemble learners are found to improve the performance of the base models and are able to meet biasvariance requirements. Other algorithms were to be evaluated as well and have the performances compared. The following is the summary of the Model Selection Process:
The entire development process was carried out in java using Weka libraries as there was a need to train and store the models and develop comparison reports
The training data was cleaned up to eliminate the noise and other redundant information in order to optimize the training time. After this process the number of tuples were reduced from 515 to 481.
As this paper is mainly concerned with Regression problem, the scope is limited to the applicable algorithms only. These algorithms were evaluated using Weka Data Mining tools.
Before going into the algorithms, it is important to establish some broad concepts
A machine learning models may be parametric or non parametric.
A parametric model summarizes the data using a fixed set of parameters which will not change with the number of instances of data. For example, a linear Regression Model tries to identify a relationship y as a linear combination of input parameters say x_{1} and x_{2} as follows
y = k_{1}x_{1} + k_{2}x_{2,} where k_{1} and k_{2} are parameters
These models are simpler to develop, quite fast in learning and relatively require less data to model. However, they suffer from being constrained by the function trying to fit the data and may not be suitable for complex data patterns and hence are likely to be poor fit in such situations
A non parametric model, on the other hand, makes no assumptions about the mapping function and tries to identify it from the training samples. Models based on K Nearest Neighbor algorithms or Support Vector Machine algorithms belong to this category.
Eager Learners are those classifiers that try to generalize the target mapping functions before being available for use. For example, Artificial Neural Networks are required to be trained before they can be used for querying.
Lazy Learners on the other hand don’t require to be trained. They only store the data and wait until a query is made. For example , K Nearest Neighbor looks for the closest matching tuples from the training set for mapping the output.
These learning systems identify and evolve rules from the training data and apply them to evaluate test data. For the Concrete Compressive Strength Model, two rule based systems: Decision Table and M5 Rules were evaluated
These are non parametric, eager learning systems which create tree like graphs or model of decisions and target values from the training data. In this paper, Decision Stump, M5P, Random Tree and REPTree algorithms were evaluated.
This is the main topic on focus as far as this paper is concerned. Multiple models are combined with the objective of improving the overall performance. These are also known as metaheuristic algorithms. In this paper the following Ensemble algorithms were evaluated:
The final chosen solution involved 2 parts:
a) An Ensemble Learning model was created taking the mean of the best performing algorithms: Multilayer Perceptrons with 4 different configurations of hidden layer, Random Committee, Random Forest and Bagging
b) This output was added to the 8 inputs and multiple Ensemble algorithms were experimented with. The best solution found was a complex Ensemble Chain consisting of Additive Regression that used Bagging as the base classifier. Bagging in turn is Random Subspace as the base classifier which in turn used REP Tree.
Table 1 lists the algorithms evaluated for building the model using Weka machine learning tool. References are provided for further information for some complex algorithms.
Algorithm 
Parametric? 
Learning type 
Metaheuristic? 
Reference for further information 
Gaussian Processes 
No 
Lazy 
No 
“Introduction to Gaussian Processes” by David J.C. Mackay 
Linear Regression 
Yes 
Eager 
No 

Simple Linear Regression 
Yes 
Eager 
No 

Multilayer Perceptron 
Yes 
Eager 
No 

SMOReg (SVM) 
Yes 
Eager 
No 
“Improvements to SMO algorithm for SVM Regression” by S.K.Shevade,S.S.Keerthi,C.Bhattacharyya and K.R.K Murthy 
IBK (K Nearest Neighbour) 
No 
Lazy 
No 
“Instance based Learning Algorithms” by David W. Aha, Dennis Kibler, Marc K.Albert

K star 
No 
Lazy 
No 
“An instance based learner using an entropy distance measure” by John G. Cleary and Leonard E. Trigg 
LWL (Locally Weighted 
No 
Lazy 
No 
“Locally Weighted Naive Bayes” by Eibe Frank, Mark Hall, and Bernhard Pfahringer

Decision Tables 
No 
Eager 
No 
“the Power of Decision Tables by Ron Kohavi for detailed information on Decision Tables

M5 Rules 
No 
Eager 
No 
“Generating Rule Sets from Model Trees”. By Geoffrey Holmes, Mark Hall, Eibe Frank

Decision Stump 
No 
Eager 
No 

M5P 
No 
Eager 
No 
“Learning with continuous classes” by J. R. Quinlan

Random Trees 
No 
Eager 
No 

REP (Reduced Error Pruning) 
No 
Eager 
No 

Bagging 
No 
Eager 
Yes 
“Bagging Predictors” by Leo Breiman

Additive Regression 
No 
Eager 
Yes 
“Stochastic Gradient Boosting” by J.H Friedman 
Random Committee 
No 
Eager 
Yes 

Randomizable Filtered Classifier 
No 
Eager 
No 

Random Subspace 
No 
Eager 
Yes 

Random Forests 
No 
Eager 
Yes 

Regression by Discretization 
No 
Eager 
Yes 
“Condition Density Estimation and Class Probability Estimator” by Eibe Franck and Remco R Bouckaert

Table 1:Summary of Algorithms experimented with for model building
The Concrete Compressive Strength model used 8 predictors: cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate. The output is the Concrete Compressive Strength.
As stated in 4.3 the model building was a 5 step process. Each of them is detailed in this section
The algorithms listed in section 5 were applied using Weka Machine learning tools. All of them were invoked using default parameters except Multilayer Perceptron.
The Weka default training time for weka for Multilayer Perceptron is 500 epochs. From experiments it was found that this is not sufficient for building complex models. It usually takes as much as 25000000 epochs for these models to reach global minima for “error per epoch”
The learning and momentum parameters had to be set at 0.01 respectively.
The following models were built. All of them used 8 inputs for predictors and one output for Concrete Compressive Strength. Other than 6.3.1 the others are all Deep Learning Models i.e. have more than one hidden layer
Figure 2 and Figure 3 show Multilayer Perceptrons having 3 and 4 hidden layers respectively.
Figure 2: Multilayer Perceptron with 3 hidden layers having 8, 16 and 8 neurons respectively
Figure 3: Multilayer Perceptron with 4 hidden layers having 8 neurons per layer
The MLP models having mean absolute error less than 5 were combined into a single Ensemble by taking the mean by a java program as these had to be custom trained and tailored and Weka GUI based tools could not be used.
The column chart in Figure 4 sums up the performance of the models:
Figure 4: Performance of MLPs
The numbers following the “MLP” represent the number of hidden layers and neurons in each layer. For example MLP 8888 has 4 hidden layers with 8 neurons in each layer. Similarly MLP 8128 has 3 hidden layers with 8, 12 and 8 neurons respectively.
The following are the key observations
Figure 5: Comparison of MLP outputs for 20 random test data points
Next all the applied algorithms listed in Section 5.8 were tested and the performance was compared using a program written in java. An ensemble was created combining the best performing models having Mean Absolute Error less than 5. The following were the models built besides the Multilayer Perceptrons
The mean of the best performing models having Mean Absolute Error less than 5 was fed as an additional input in the training sample and experimented with complex Ensemble Algorithms. The ensemble that performed best was Additive Regression which is a Stochastic Gradient Booster. Bagging was chosen as its base algorithm. Bagging in turn used an Ensemble Random Sub Space which used REP Tree. The performance improved significantly with this Ensemble Chain,
Figure 6: Performance of Multiple Regression Models
The following is the summary of the results:
Figure 7 Comparison of MultiRegression Models for 20 random data points. The Mean of the selected models includes best MLP performers as well
Having developed a model, our next step is to try finding the optimum set of parameters for maximising Concrete Compressive Strength. The entire data set which includes both the training and test data is passed to the GA optimiser solution developed by Turing Point. For further information on implementation of the GA refer here
The GA algorithm selects candidates from the entire data set for “mating” and generating “children” data points. The candidates are chosen on the basis of their fitness level exceeding a threshold set at data points that can generate Concrete Compressive Strength at the most 5 megapascals below the maximum generated Concrete Compressive Strength (82.6 MegaPascals)
The exit criterion for the algorithm is set at 10 generations of no improvement in fitness level.
Figure 8 shows the 10 high values of the Concrete Compressive Strength found in the supplied data set. The data points associated with these Concrete Compressive Strength form the Target Candidates for generating the high performing solutions
Figure 8: 10 Best Performing Data Points from the Data sets
Figure 9 shows the generated top performers from the starting candidates. The data was generated with 5 different random seeds. The highest Predicted value (81) however is less than the highest in the initial data set. Hence a better solution could not be found.
Figure 9: Predicted Top performer Concrete Compressive Strength using Genetic Algorithm
This paper successfully presented the process of building a best fit complex machine learning model. The following are some of the key lessons learnt
a) http://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength
b) https://www.analyticsvidhya.com/blog/2015/08/introductionensemble...
c) “Introduction to Gaussian Processes” by David J.C. Mackay
d) “Improvements to SMO algorithm for SVM Regression” by S.K.Shevade,S.S.Keerthi,C.Bhattacharyya and K.R.K Murthy
e) “Instance based Learning Algorithms” by David W. Aha, Dennis Kibler, Marc K.Albert
f) “An instance based learner using an entropy distance measure” by John G. Cleary and Leonard E. Trigg
g) “Locally Weighted Naive Bayes” by Eibe Frank, Mark Hall, and Bernhard Pfahringer
h) “the Power of Decision Tables by Ron Kohavi for detailed information on Decision Tables
i) “Generating Rule Sets from Model Trees”. By Geoffrey Holmes, Mark Hall, Eibe Frank
j) “Learning with continuous classes” by J. R. Quinlan
k) “Bagging Predictors” by Leo Breiman (September 1994)
l) “Stochastic Gradient Boosting” by J.H Friedman
m) “Condition Density Estimation and Class Probability Estimator” by Eibe Franck and Remco R Bouckaert
n) Building a Production Optimisation Solution using Deep Learning and... by Sankaran Iyer (Turing Point January 2017)
o) Ensemble Methods in Machine Learning by Thomas G Dietterich
p) Error Reduction through Learning Multiple Descriptions by Ali K.M and Pazzani M.J. Machine Learning, 24, 173–202 (1996)
q) An Empirical Comparison of Voting Classification Algorithms: Baggin... by Eric Bauer and Ron Kothavi (1998)
r) Training a 3Node Neural Network is NPComplete by Avrim L. Blum and Ronald L. Rivest (1998)
s) Human expert level performance on a scienti c image analysis task b... by Kevin J. Cherkauer (1996)
t) An experimental comparison of three methods for constructing ensemb... by Thomas G. Dietterich, Machine Learning (2000)
u) A DecisionTheoretic Generalization of OnLine Learning and an Appl... by Yoav Freund and Robert E. Schapire (December 1996)
v) Experiments with a new boosting algorithm by Yoav Freund and Robert E. Schapire (1996)
w) Neural Network Ensembles by LK Hansen, P Salamon  IEEE transactions on pattern analysis 1990  ieeexplore.ieee.org
x) Universal Approximation of an Unknown Mapping and Its Derivatives U... by Kurt Hornik, Maxwell Stinchcombe and Albert White (1990)
y) Constructing optimal binary decision trees is NP Complete Informati... by Laurent Hyafil and Ronald L. Rivest (1976)
z) Back propagation is sensitive to initial conditions In Advances in ... by John F. Kolen and Jordan B. Pollack
aa) Multiple Decision Trees by S.W. Kwok and C. Carter (1990)
bb) Probabilistic Inference Using Markov Chain Monte Carlo Methods by Radford M. Neal (1993)
cc) Improving committee diagnosis with resampling techniques by B.Parmanto, P.W. Munro and H.R.Doyle (1993)
dd) Bootstrapping with Noise: An Effective Regularization Technique by Yuval Raviv and Nathan Intrator (1996)
ee) Extending Local Learners with ErrorCorrecting Output Codes by Francesco Ricci and David W. Aha (1997)
ff) Using output codes to boost multiclass learning problems by Robert E. Schapire (1997)
gg) Boosting the margin: A new explanation for the effectiveness of vot... by Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee (1997)
hh) Improved Boosting Algorithms Using Confidencerated Predictions by Robert E. Schapire and Yoram Singer (1998)
ii) Error correlation and error reduction in ensemble classifiers by Kagan Tumer and Joydeep Ghosh (1996)
© 2019 Data Science Central ® Powered by
Badges  Report an Issue  Privacy Policy  Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Technical
Non Technical
Articles from top bloggers
Other popular resources
Archives: 20082014  20152016  20172019  Book 1  Book 2  More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central