Predicting The Status of Waterpoints

  • SupStat 

Contributed by Gordon Fleetwood. He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his fourth class project(due at 8th week of the program).

The Problem

Kaggle is the most famous organization dedicated to Data Science competitions, but there are others. One of these is Driven Data. Beyond the difference in visibility, a point of divergence between Kaggle and Driven Data is that the latter’s competitions often have a social focus as opposed to the former’s usual business leanings.

One of the current active competitions on Driven Data is “Pump It Up: Data Mining the Water Table”. Using data from Taarifa and the Tanzania Ministry of Water, participants must build models to predict the status of a waterpoint: functioning, in need of repair, or non-functioning. In short, the goal was classification. The data provided included geographic information, structural information, and administrative information.

Following the usual paradigm of these competitions, a training set (in this case with features and labels in separate files) and a test set without labels were provided. One’s score on the leader board was determined by classification rate. Outside of bragging rights, however, the competition seems to be set up to potentially have a real world impact. As it says on the competition page, “A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.”


Exploratory Analysis

As per usual, exploratory data analysis was the first thing on my plate. In this phase of the process I grasped a firm understanding of the data’s structure, and built a running checklist of the type of pre-processing steps to go through before building my models.

The first of these observations was the abundance of categorical variables. Only 11 of the 42 features consisted of numerical data. Another was a baseline accuracy rate. If I were to predict that every waterpoint was functioning, I would be right approximately 54% of the time.

The close to 60,000 data records had several columns with a lot of missing values, two of which related to the organization that funded the building of the waterpoint, and the one that installed it. A Further look showed that most of these columns were heavily skewed to one value.

for col in list(missingness.index):
print(col, ': (',
df_complete[col].value_counts().index[0], ',',

For example, the Government of Tanzania was by fair the most frequent funder of building waterpoints. With this information in mind, I felt comfortable adding imputation to my list of pre-processing steps.

Another section of my exploratory work was not as fruitful, however. The data held information on when the waterpoint was built, and when information on that waterpoint was collection. Intuitively, The difference between these dates would serve as a good proxy for the waterpoint’s age, which would in turn serve as a good feature in the modeling process. One would think that the older a waterpoint, the more susceptible it would be to wear and tear.

Alas, this potential feature was not to be. Although the date of construction data showed no missing values, almost half of these data points were registered as zero, suggesting that those waterpoints have been around since the switch from B.C to A.D. To further complicate matters, these ancient wells were almost equally split between functioning, and either non-functioning or in need of repairs.

I made the decision to put date features on the back burner and progress into the next step of the process.


I started off by dropping the aforementioned date columns, and then imputed the features identified as being ripe for this process by the most frequently occurring value. (A key point of note here is that I transformed the testing data along with the training data, for sanity’s sake.)

from statistics import mode
columns_to_impute = ['funder', 'installer', 'public_meeting', 'scheme_management', 'permit']

for col in columns_to_impute:
features_df[col].fillna(value = mode(features_df[col]), inplace = True)
test[col].fillna(value = mode(features_df[col]), inplace = True)

I then dropped some more columns due to a severe repetition of geographical information – latitude and longitude for example.

At this point I wanted to try three different features spaces. This decision was brought on by the overwhelming presence of categorical data.

1) Naive: Keep the features as they were and use them in modeling.

2) Label Encoding: Have the strings in the features replaced by numbers to make them more malleable to modeling while committing the error of imposing a numerical relationship between then. For example, having ‘good’, ‘better’, and ‘best’ being replaced by 1, 2, and 3 respectively implies that ‘better’ is halfway between the other two choices.

def transform_feature(df, column_name):
unique_values = set(df[column_name].tolist())
transformer_dict = {}
for index, value in enumerate(unique_values):
transformer_dict[value] = index
df[column_name] = df[column_name].apply(lambda y: transformer_dict[y])
return df
for column in columns_to_transform:
features_df = transform_feature(features_df, column)
test = transform_feature(test, column)

3) Hot Encoding: In the model the features would be expanded based on the unique values contained in each. To lean on the example from Label Encoding, one column consisting of ‘good’, ‘better’, and ‘best’ would be replaced by three–one corresponding to each value–with a binary value indicated the presence or absence of each choice.

Of the three methods, only Label Encoding provided any results. The first ran into problems with the scikit-learn A.P.I, and the third proved to be too computationally exhaustive. The Naive method was no great loss–its name being an obvious clue–but the failure to complete the hot encoding of features is one that I will have to rectify in the future. The problem here is the strain on a computer’s memory brought upon by the ballooning of 40 features into the realm of 5000 or so due to the variety of unique values of each.

I moved forward with the second feature space to see how well it would prove to work.


My first stab at this multi-classification problem was to try the Multinomial Naive Bayes algorithm from the scikit-learn A.P.I. Even bolstered by cross validation, it proved to be woefully inadequate, performing at about 30% on both the training and validation data.

Out of curiosity, I already tried the Tree-based Pipeline Optimization Tool, also known as TPOT. This new library uses genetic algorithms to explore many possibilities of modeling data, before choosing one it thinks to be optimum. Unlike Multinomial Naive Bayes, it managed to beat the baseline accuracy, but approximately 65% accuracy on the testing data was hardly newsworthy.

Next I tried a Random Forest Classifier, one of the most robust algorithms for classification currently available.

import sklearn.ensemble
clf_forest = sklearn.ensemble.RandomForestClassifier()
clf_forest.fit(X_train, y_train)
print(clf_forest.score(X_train, y_train))
print(clf_forest.score(X_test, y_test))

It’s accuracy on the training data was around 96%, which fell to 79% on the validation data. Here I felt comfortable making my first submission. Predictably, I wasn’t too high on the leaderboard.

Back at the drawing board, I decided to continue with this feature set, and try to do some hyperparameter tuning on the Random Forest Classifier. As a recent article I read extolled the superiority of using a randomized grid search over a normal grid search for this optimization, I decided to try it out.

import sklearn.grid_search
from scipy.stats import randint as sp_randint

clf_forest2 = sklearn.ensemble.RandomForestClassifier()
n_iter_search = 10
param_dist = {“max_depth”: [3, None],
“max_features”: sp_randint(1, 11),
“min_samples_split”: sp_randint(1, 11),
“min_samples_leaf”: sp_randint(1, 11),
“bootstrap”: [True, False],
“criterion”: [“gini”, “entropy”]}

random_search = sklearn.grid_search.RandomizedSearchCV(clf_forest2,
param_distributions = param_dist,
n_iter = n_iter_search)

random_search.fit(X_train, y_train)
print(random_search.score(X_train, y_train))
print(random_search.score(X_test, y_test))

This did better than the un-tuned Random Forest Classifier on the data I had, but failed to beat the leaderboard score it achieved. It’s back to the drawing board.

Closing Thoughts

My biggest next step is to build a scalable environment that will handle the more robust feature set that I can create with one hot encoding. (I have tried a few AWS instances to achieve this goal, but they all seemed to fail short.) More hyperparameter tuning with both grid search and randomized grid search are also in my plans. I also want to try other classification algorithms that aren’t off the table due to the number of categorical features–Linear Discriminant Analysis, for one. The Kaggle algorithm hammer known as xgboost is also on the table, but I would like to shy away from it for one simple reason: interpretability.

Outside of predictive power, any model to be used in a real world solution to this problem has to be an open box. I want to be cognizant of that as I go forward.