Predicting Smart Meters’ Next Non-Communicative Day

Predicting Smart Meters’ Next Non-Communicative Day

Smart meters are used for billing of electricity, gas, water, and heat for residential, commercial, and production sites. The technological advancements in these smart meters are being driven by the demand from utility companies to create a smarter and more efficient grid aimed at reducing non-revenue losses. However, these smart meters have challenges to maintain and identifying problems.

Identifying Faults/ Maintenance Challenges

Smart meters go non-communicative for many reasons which include an excess of electric energy measurement, under voltage of battery, communication failure, timing error, abnormal display, and burnout, and weather, etc.  In addition, meters stops working for a while and does so every now and then with increasing frequency. Eventually, it produces periods of zero consumption in the data. When the meter does not communicate it not only data loss, it leads to inaccurate billing, more maintenance trip to fixing, and unhappy customers.
Once a certain percentage of faults occur in the running process of smart meter, the electric company will make statistics of the fault information to grade the severity of the situation and release operation risk warning to fix it. On that basis they carry on the follow-up analysis. However, this process requests high requirements of fault phenomenon confirmation, positioning, and the technical ability and experience of classification technical personnel.
Thus, no matter for the provincial branches of the State Grid or for the power meter manufacturers, the development of standardized smart meter fault classification and fault information statistical process, especially, the development of automatic sorting device aiming at automatic fault diagnosis and sorting for the removed smart meters through the fault diagnosis and fault table sorting technique, is an important method to solve the problem. Predictive analytics helps us a lot on this one. One of the many opportunities it can provide is predicting the next non-communicative day of the smart meter.
What if you know if a smart meter is likely to go offline and victim for data loss in next 15 days.  The objective is to predict whether the meter could fail earlier than it actually fails. Machine Learning model to predict when the smart meter will go non-communicative that leads in-person checking is essential to save cost.
We can build our strategy on top of that and come up with lots of tactical actions like: When fixing the other permanently stopped meters, service men could validate nearby potential failure meters as
In this article, I will be explaining the smart meter data, characteristics and the how I approached to solved this problem using machine learning algorithms.
There are for actions need to develop a predictive model
       Data Preparation
       Feature Engineering
       Selecting a Machine Learning Model
       Hyperparameter Tuning
       Deploying the model (not discussed in this article)

Data Preparation
For Illustrate the solution approach, I made data set for this model development by merging and transforming from publicly available datasets (see the reference for the data). I cleaned and removed features for simplicity purpose. The cleaned data have an electricity consumption reading collected by smart meters over more than 2 years for every day from 23-nov-2011 to 28-feb-2014.
Data Schema
Field Name
Unique Alphanumeric smart meter id
Unique Smart meter id (numeric)
Reading date
Total reading count for the day
Median value of consumption reading
Mean value of consumption reading
Standard deviation of consumption reading
Total consumption reading
Minimum value of consumption reading
Maximum value of consumption reading
Non-communicative meter definition
There are two scenarios the meter considered as non-communicative. Frist, the meter is record zero consumption for any time interval and value of energy_min field is zero. Second, since smart meters read data either every hour or every 30 mins, for a normal day, reading count either be 24 or 48, reading count is not 24 or 48 (depends on meter type).
Here is the full meter reading data looks like:

This project we are trying to predict next non-communicative occurrence from the series of non-communicative (NC) dates. I need to filter the non-communicative meter reading dates from the full data set.  Here is the python code ( see the reference for the full python code)
nc_meter_data = meter_data_full[(meter_data_full.energy_min ==0) | ((meter_data_full.reading_count != 24) & (meter_data_full.reading_count != 48))]

Data structure for training the model
We use nine months of daily energy consumptions data by smart meter and to predict use non-communitive date in the next three months data. If there is no non-communicative event in next 3. months, we will predict that too. Let’s assume our cutoff date is Nov, 30, 2013 and split the data:
nc_meter_data_9m = nc_meter_data [(nc_meter_data.reading_date < datetime(2013,12,1,0,0)) & (nc_meter_data.reading_date >= datetime(2013,1,1,0,0))].reset_index(drop=True)
nc_meter_data_next = nc_meter_data[(nc_meter_data.reading_date >= datetime(2013,12,1,0,0)) & (nc_meter_data.reading_date < datetime(2014,3,1,0,0))].reset_index(drop=True)
nc_meters = pd.DataFrame(nc_meter_data_9m['meter_nbr'].unique())
nc_meters.columns = ['meter_nbr']
nc_meter_data_9m represents the nine months data, nc_meter_data_next holds meter data for the next 3 months.Also, we will create a dataframe called nc_meters to possess a meter-level feature set for the prediction model:
nc_meters = pd.DataFrame(nc_meter_data_9m['meter_nbr'].unique())
nc_meters.columns = ['meter_nbr']
By using the data in nc_meter_data_next , we need the calculate our label (days between last NC (non-communication) day before cutoff date and first NC (non-communication) day after that):
NextNCDay is the number of days the meter become non-comm from last 9 months. Now, nc_meters look like below:
As you can easily notice, we have NaN values because those meters are good meters that does have -become non-communicative for last 3 months. So, we fill NaN with 999 (otherwise, a good meter that not expected become non-communicative soon to quickly identify them later.
Now, we have meter numbers and corresponding labels in a panda dataframe named nc_meters. Let’s enrich it with our feature set to build our machine learning model.

Feature Engineering
For this project, We built 11 variables to detect the consumption behavior of every smart meter. Those are:

  • Days between the last three non-communication events
  • Mean and std. of the difference between non-communication events in days
  • Recency – total days from reading_date to latest date (max reading_date)
  • Recency Cluster Index – Recency score is divided into 4 clusters using unsupervised machine learning algorithm (kmean)
  • Consumption – daily total energy consumption on non-communicative days
  • Consumption Cluster Index – Consumption score is divided into 4 clusters using unsupervised machine learning algorithm (kmean)
  • Frequency – total non-communication events occurred for 9 months for a meter
  • Frequency Cluster Index – Frequency score is divided into 4 clusters using unsupervised machine learning algorithm (kmean)
  • Overall Cluster Score – Combined index score from Recency, Consumption, and Frequency cluster index after rearrange index by mean of each cluster.
  • Next NC Day Range  - Category of the meters by next non-communicative day range. When next NC event is < 15 days then 0, 15-30 days is 1 and others 2.
  • Next NC Day Range  - Category of the meters by next non-communicative day range. When next NC event is < 15 days then 0, 15-30 days is 1 and others 2.
  • Still many features based on customer type, weather, meter type, holidays, peak hours, week of day, transmitter, transformer, collector, etc. are possible. For the illustrate the problem and solution in a simple way, I just used only these features.
After adding these features, we need to deal with the categorical features by applying get_dummies method.
Let’s focus on how we can add the next two features. We will be using shift () method a lot in this feature engineering process.
First, we create a dataframe with meter_nbr and reading_date.

#create a dataframe with meter nbr and NC Date
meter_nc_day_order = nc_meter_data_9m[['meter_nbr','reading_date']]
#convert reading Datetime to day
meter_nc_day_order['nc_date'] = nc_meter_data_9m['reading_date'].dt.date
meter_nc_day_order = meter_nc_day_order.sort_values(['meter_nbr','reading_date'])
Next, by using shift, we create new columns with the dates of last 3 non-communicative event occurred.

#shifting last 3 nc dates
meter_nc_day_order['PrevNCDate'] = meter_nc_day_order.groupby('meter_nbr')['nc_date'].shift(1)
meter_nc_day_order['T2NCDate'] = meter_nc_day_order.groupby('meter_nbr')['nc_date'].shift(2)
meter_nc_day_order['T3NCDate'] = meter_nc_day_order.groupby('meter_nbr')['nc_date'].shift(3)

Let’s begin calculating the difference in days for each NC (non-communicative) dates
meter_nc_day_order['DayDiff'] = (meter_nc_day_order['nc_date'] - meter_nc_day_order['PrevNCDate']).dt.days
meter_nc_day_order['DayDiff2'] = (meter_nc_day_order['nc_date'] - meter_nc_day_order['T2NCDate']).dt.days
meter_nc_day_order['DayDiff3'] = (meter_nc_day_order['nc_date'] - meter_nc_day_order['T3NCDate']).dt.days

For each meter_nbr, we using pandas .agg() method to find out the mean and standard deviation of the difference between NC dates:
meter_nc_day_diff = meter_nc_day_order.groupby('meter_nbr').agg({'DayDiff': ['mean','std']}).reset_index()
meter_nc_day_diff.columns = ['meter_nbr', 'DayDiffMean','DayDiffStd']
Here we have to make a tactical decision about non-communicate meters’ rule to select the meters that has higher chance of being non-communicative based on frequency of non-communicative event . This rule is quite useful for meters who have many times become non-communicative. But we can’t say the same for the ones with 1–2 non-communicative occurrences. For instance, it is too early to tag a meter as frequent who has only 2 non-communicative events in past 9 months.
We only keep customers who have > 3 purchases by using the following line:
meter_nc_day_order_last = meter_nc_day_order.drop_duplicates(subset=['meter_nbr'],keep='last')
Finally, we drop NA values, merge new dataframes with meters and apply .get_dummies() for converting categorical values:
meter_nc_day_order_last = meter_nc_day_order_last.dropna()
meter_nc_day_order_last = pd.merge(meter_nc_day_order_last, meter_nc_day_diff, on='meter_nbr')
nc_meters = pd.merge(nc_meters, meter_nc_day_order_last[['meter_nbr','DayDiff','DayDiff2','DayDiff3','DayDiffMean','DayDiffStd']], on='meter_nbr')
#create nc_meter_class as a copy of nc_meters before applying get_dummies
nc_meter_class = nc_meters.copy()
nc_meter_class = pd.get_dummies(data=nc_meter_class,columns=['Segment'])
Now the feature set is ready for building a classification model. Let us proceed
Selecting a Machine Learning Model
Before jumping into choosing the model, we need to take two actions. First, we need to identify the classes for our model. Generally, percentiles give the right for that. Let’s use .describe() method to see them in NextNCDay:

Deciding the boundaries is a question for both statistics and business needs. It should make sense in terms of the first one and be easy to act and communicate. Considering these two, we will have three classes:
·       0–15: Meters that will data loss in 0–15 days — Class name: 1
·       16-30: Customers that will data loss in 16–30 days — Class name: 0
·       ≥ 30: Customers that will data loss in more than 30 days — Class name: 2
nc_meter_class['NextNCDayRange'] = 2
nc_meter_class.loc[nc_meter_class.NextNCDay>15,'NextNCDayRange'] = 1
nc_meter_class.loc[nc_meter_class.NextNCDay>30,'NextNCDayRange'] = 0
The last step is to see the correlation between our features and label. The correlation matrix is one of the cleanest ways to show this:
corr = nc_meter_class[nc_meter_class.columns].corr()
plt.figure(figsize = (30,20))
sns.heatmap(corr, annot = True, linewidths=0.2, fmt=".2f")

Looks like Overall Score has the highest positive correlation (0.71) and Recency has the highest negative (-0.40).
For this particular problem, we want to use want to try Linear Regression, Naïve Bayes, Random Forest, Support Vector Classifier, Decision Tree and Nearest-Neighbors classifier models and choose the model which gives the highest accuracy. Let’s split train and test tests and measure the accuracy of different models:
Accuracy per each model:

From this result, we see that Random Forest is the best performing one (~94% accuracy). But before that, let’s look at what we did exactly. We applied a fundamental concept in Machine Learning, which is Cross Validation.
How can we be sure of the stability of our machine learning model across different datasets? Also, what if there is a noise in the test set we selected.
Cross Validation is a way of measuring this. It provides the score of the model by selecting different test sets. If the deviation is low, it means the model is stable. In our case, the deviations between scores are acceptable (except Decision Tree Classifier).
let’s move forward with Random Forest to see how we can improve an existing model with some advanced techniques.

Hyperparameter Tuning
To build our model, we will follow the steps in the previous articles. But for improving it further, we’ll do Hyperparameter Tuning. Programmatically, we will find out what are the best parameters for our model to make it provide the best accuracy.
Let’s start with coding our model first:
rf = RandomForestClassifier()
# Random search of paranc_meters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
In this version, our accuracy on the test set is 95%:
base_model = RandomForestClassifier()
base_model.fit(X_train, y_train)
base_accuracy = evaluate(base_model, X_test, y_test)

RandomForestClassifier() has many parameters. You can find the list of them here. For this example, we will select n_estimators,max_features,max_depth,min_samples_split,min_samples_leaf and bootstrap.
The code below will generate the best values for these parameters:
from sklearn.model_selection import GridSearchCV
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
gsearch1 = GridSearchCV(estimator = RandomForestClassifier(),
param_grid = random_grid, scoring='accuracy',n_jobs=-1,iid=False, cv=2)
gsearch1.best_params_, gsearch1.best_score_

When we used best parameters from GridSearchCV, accuracy score increased from 94% to 96%. You can find the Jupyter Notebook and the data for this article in my  github
Knowing the list of meters which will become non-communicative in next week or month will be a good resource measure for preventive maintenance. When I solved this problem for a client I considered many more features such as properties of transformer, collectors, customer type, meter model, and holiday calendar, peak hours, weekend, day or night, etc., Also faced challenges for defining non-communicative status, non-communicative due to transformers failures. I hope this article will give you a high-level idea about the problem and way to solve it.
*image taken from: www.safemeters.org

Views: 551

Tags: dsc_analytics, dsc_iot, dsc_tagged


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service