In many cases, you may think that you have a Big Data problem, when in reality you just have a lot of data that a simple sampling can result in great accuracy. In todays blog, I decided to use office room occupancy dataset provided by"Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Luis M. Candanedo, Veronique Feldheim. Energy and Buildings. Volume 112, 15 January 2016, Pages 2839." The dataset provided has 6 independent variables (predictors): date with timestamp; temperature of the room in Celsius; relative humidity in percent, light in Lux; CO2 in ppm, and humidity ratio or the ratio between temperature and humidity. The occupancy is a categorical variable with 2 levels: 0 for not occupied; and 1 for occupied. The occupancy has been measured every minutes, for the period of February 11, 2015 to February 18, 2015, and its dataset size is 9,752. The question I want to investigate is can a small random sample produce performance as good as large sample? For the model, I will build a Deep Feed Forward (DFF) Learning Model.
Method:
The occupancy dataset has a one minute interval timestamp for each of the 7 recoded days. I decided to remove it. While time may allows us to predict if a room is occupied, it is a "flowed" variable. The company may decide to have a party on weekend or after hour on a days not in our list, or a holiday may fall during business days. Furthermore, from the following table, one can see that the daily occupancy frequencies don't match except for 2/14/15 and 2/15/15which represent the weekend. Hence, the date is ignored during the occupancy modeling.
From the following table, one may come to the conclusion that we have unbalanced class problem we have more records of non occupancy versus occupancy. In fact, as one explores the matrix scatterplot, it will be apparent that one is not dealing with unbalanced class but just we have a larger sample for non occupancy versus occupancy. In this example, in a 24 hour period, most people spend less time at work than outside work. Still this table, shows that 2/14/15, 2/15/15/, and 2/18/15 are not suitable as training setgiven that the model is a NN model.
Date

Non Occupancy

Occupancy

2/11/15

338

214

2/12/15

1196

244

2/13/15

946

494

2/14/15

1440


2/15/15

1440


2/16/15

889

551

2/17/15

903

537

2/18/15

551

9

From the matrix scatterplot, especially looking at the boxplots comparing the predictors values between non occupancy and occupancy, we can see that except for humidity and humidity ratio, differences exist in temperature, in light, and in CO2 if the room is not occupied or occupied. From the last row in the matrix scatterplot, the distributions of temperature, light, and CO2 appear to differ if the room is occupied or notconfirming that that these variables should be kept. While not shown here, I did perform Wilcoxon Rank Sum Test to confirm the differences between the medians. Hence, I can remove humidity ratio, and even humidity. However, I decided to remove only humidity ratio.
Furthermore, one notices a lot of outliers (the black dots.) If we assume 100% accuracy of measurement, outliers implies special cases not errors and they should not be discarded just o improve the model performance. Outliers, in this instance, also implies that a nonlinear or nonparametric machine learning model will perform better than a linear model.
None of the distributions are normal, but light appears to be extremely skewed, and values beyond 1,000 implies occupancy. On the other hand, extreme CO2 or Humidity Ration implies the room is not occupied. While these observations are interesting, they are irrelevant when one wants the model to infer. Scaling and centering the variables is suggested in deep learning (help get to the global minimum) . In this case, I want to see if a good model can be produced even when the variables' distributions are not scaled or centered.
Now, I have decided on the model, I also decided not to scale and center the variables.
Designing the Model:
I addition to trying building a Random Forest, and K Nearest Neighbor, I will also use a 3 hidden layers Neural Network.
The model I created has 5 hidden layers. The first 2 layers have 10 nodes, the third layer has 8 nodes, the forth layer has 6 nodes, and the last layer has 2 nodes. In all the layers, except the last layer, I used ReLu, Rectified Linear function, as the activation function. As I'm building a classifier, I have to use Softmax function to calculate the probability of occupancy and non occupancy. The model chooses the outcome with the highest probability.
The model uses Cross Entropy Loss function, and for training Stochastic Gradient Descent with initial learning rate of 0.001, momentum of 0.93, and a polynomial function for the scaling rate.
I will also build a Random Forest with 50 trees, and a K Nearest Neighbor with K equal 10.
Training the Model
In this problem, do we train on 1 day and test on another day, and validate on another? Does it matter if one chooses non sequential dates for testing and training set? For this last question, it will not matter, I have already decided that I will not consider time. For the first question, I will try it but with the test data,
datatest.txt, provided in GitHub by the research paper. This dataset has occupancy dataset for 2/2/15 to 2/4/15.
Understanding the context always helps speed up decisions. This looks promising, but from the frequency table 2/14/15, 2/15/15, and 2/18/15 have none or very few occupancy, and cannot be used for training or testing the model. Hence, I'm left with only 5 potential days. Unfortunately, this approach is different from the approach I will present next. Hence, to compare performance, I'll have to use datatest.txt as the validation set.
While the number of recorded non occupancy is much larger than the number of occupancy, one can see that in spite of outliers one can distinguish between occupancy and non occupancy by looking at the temperature, Light, and CO2. Hence, training and testing can be set so an equal number of occupancy and non occupancy are sampled. Now, the big question is the size of the training set and testing set. Using the knowledge that for a sample of minimum size 30 for a one continuous variable is considered large, Hence, if I have 4 predictors, and will need at least to sample of 120 (30 * 4) for occupancy and 120 for non occupancy. I have a large dataset, hence, I have decided to sample 200 for each level.
I sample 400 samples for each training set and testing set from the large one week dataset. Neither of the samples have same observations (very important to ensure no bias.) Whatever is left in the dataset I use as a validation set.
Results
Performance with Sampling for 2/11/15 to 2/18/15
Deep Feed Forward
Sampling
Our performance is very good with accuracy 98.44%. The model is even able to predict the 8 minutes the room was occupied in 2/18/15.

Non Occupancy

Occupancy

Precision

98.13%

99.51%

Recall

99.86%

93.78%

Fscore

98.99%

96.56%

Using this model on the datatest.txt, containing 2/02/15 to 2/04/15, its performance was still good 97.94%. As one can see, the model was able to predict with 100% accuracy occupancy.
From this sample problem, one can see that having a lot of data doesn't mean you need a large sample to accurately model them.
Date as a Sample
The rows represent the training set date, and the columns represent the testing set date. While the all performed well, using February 17 as the testing set and February 13 or February 12 gave the highest accuracy of 97.94%, and using February 17 as a testing set resulted in consistent and high accuracy. Still the performance was not consistent for all days. The worst performing model training and testing combination dates where February 12 and 13: the accuracy was only 87.24%. This implies choosing the most convenient sampling approach may not always provide the best accuracy. One can also notice that one has to train the model 20 timesa time consuming approach.
Training/Testing 
11Feb 
12Feb 
13Feb 
16Feb 
17Feb 
11Feb 

89.49% 
89.57% 
89.53% 
92.53% 
12Feb 
90.77% 

87.24% 
89.76% 
97.94% 
13Feb 
97.90% 
90.43% 

97.49% 
97.94% 
16Feb 
91.33% 
91.93% 
91.14% 

93.02% 
17Feb 
96.06% 
97.90% 
97.82% 
97.82% 

Random Forest and K Nearest Neighbors
I also built a K Nearest Neighbors, KNN, model and a Random Forest model. The KNN with K = 10 had an accuracy of 97.82%, and Random Forest with 50 trees had an accuracy of 97.86%.
KNN Performance

Non Occupancy

Occupancy

Precision

96.93%

99.39%

Recall

99.64%

94.89%

Fscore

98.26%

97.08% 
Random Forest Performance

Non Occupancy

Occupancy

Precision

96.75%

99.79%

Recall

99.87%

94.63%

Fscore

98.29%

97.15% 
These results are consistent with the DFF, with accuracy of 97.94%. DFF performance is slightly better than the other models. Still it is able to predict and recall perfectly when a office room is occupied.

Non Occupancy

Occupancy

Precision

96.75%

1%

Recall

1%

94.64%

Fscore

98.35%

97.24% 
These results are very impressive, knowing that the accuracy in the research paper was only 85%.
Discussion and Conclusion
Many assume they need a lot of data to build a model or make a prediction, that more is always better. In here, one can see models trained with a sample size of 400 performed better than a model trained with a sample size of 8,143. Combining the sizes of both training set and testing set, 800, is still smaller than the sample size used in the research paper.
One may state that even with the assigning some dates to training and others to testing I got good result. Unfortunately, the accuracy depends on the right combination. While the best combinations resulted in accuracy equal to DFF, others where significantly below it as 87.24%. Furthermore, one has to train using all combination to figure out the best combination; not a problem with just 5 days but what if one has 10 days or 20 days?
The performance of the models is directly related to understanding the problem and the data at hand. While it appeared the dataset had unbalanced classes, the unbalance could be considered inconsequential. While the researchers kept timestamps to the minute, time was inconsequential. Preanalysis of the data also helped detect unnecessary variable like humidity ratio. In short, understanding the dataset will help speed up model building, but most importantly improve model prediction with little work.
Note: The code is written in Mathematica. I upload it to
Github in the next few days.
You need to be a member of Data Science Central to add comments!
Join Data Science Central