Subscribe to DSC Newsletter

Predicting Flights Delay Using Supervised Learning, Logistic Regression

1. Introduction

In this post, we’ll use a supervised machine learning technique called logistic regression to predict delayed flights. But before we proceed, I like to give condolences to the family of the the victims of the Germanwings tragedy.

This analysis is conducted using a public data set that can be obtained here:

  1. https://catalog.data.gov/dataset/airline-on-time-performance-and-ca...
  2. http://stat-computing.org/dataexpo/2009/the-data.html

Note: This is a common data set in the machine learning community to test out algorithms and models given it’s publicly available and have sizable data.

In this blog, we will look at small sample snapsot(2201 flights in January 2004). In another post, we can explore using Big Data technologies such as Hadoop MapReduce or Spark machine learning libraries to do large scale predictive analytics and data mining.

Let’s load in our small sample set here and see the first 5 rows of data:

## CRS_DEP_TIME CARRIER DEP_TIME DEST DISTANCE FL_DATE FL_NUM ORIGIN
## 1 1455 OH 1455 JFK 184 37987 5935 BWI
## 2 1640 DH 1640 JFK 213 37987 6155 DCA
## 3 1245 DH 1245 LGA 229 37987 7208 IAD
## 4 1715 DH 1709 LGA 229 37987 7215 IAD
## 5 1039 DH 1035 LGA 229 37987 7792 IAD
## Weather DAY_WEEK DAY_OF_MONTH TAIL_NUM Flight.Status
## 1 0 4 1 N940CA ontime
## 2 0 4 1 N405FJ ontime
## 3 0 4 1 N695BR ontime
## 4 0 4 1 N662BR ontime
## 5 0 4 1 N698BR ontime

We see the following variables are collect:

  • Departure Time(CRS_DEP_TIME)
  • Carrier, Destination(DEST)
  • Distance
  • Flight Date(FL_DATE)
  • Flight Number(FL_NUM)
  • Weather(Code of 1 represents there was a weather-related delay)
  • Day of the Week
  • Day of the Month
  • Tail Number
  • Flight Status(where it’s ontime or delayed)

The goal here is to identify flights that are likely to be delayed. In the machine learning literature this is called a binary classification using supervised learning. We are bucketing flights into delayed or ontime(hence binary classification). (Note: Prediction and classification are two main big goals of data mining and data science. On a deeper philosophical level, they are two sides of the same coin. To classify things is predicting as well if you think about it.)

Logistic regression provides us with a probability of belonging to one or the two cases(delayed or ontime). Since probability ranges from 0 to 1, we will use the 0.5 cutoff to determine which bucket to put our probability estimates in. If the probability estimate from the logistic regression is equal to or greater tha 0.5 then we assign it to be ontime else it’s delayed. We’ll explain the theory behind logistic regression in another post.

But before we start our modeling exercise, it’s good to take a visual look at what we are trying to predict to see what it looks like. Since we are trying to predict delayed flights with historical data, let’s do a simple histogram plot to see the distribution of flights delayed vs. ontime:

We see that most flights are ontime(81%, as expected). But we need to have delayed flights in our dataset in order to train the machine to learn from this delayed subset to predict if future flights will be delayed.

2. Exploratory Data Analysis (EDA):

The next step in predictive analytics is to explore our underlying data. Let’s do a few plots of our explantory variables to see how they look against Delayed Flights.

Carriers Distribution in the Data Set

Carrier  Count  Percentage
CO 94 4.3%
DH 551 25%
DL 388 17.6%
MQ 295 13.4%
OH 30 1.4%
RU 408 18.5%
UA 31 1.4%
US 404 18.4%

Please note the following:

  • CO: Continental
  • DH: Atlantic Coast
  • DL: Delta
  • MQ: American Eagle
  • OH: Comair
  • RU: Continental Express
  • UA: United
  • US: US Airways

Let’s example Day of the Week effect. We see that Mondays and Sundays have the most delayed flights and Saturdays have the least. Note: 1 is Monday and 7 is Sunday.

Destination airport effect.

Origin airport effect.

3. Data Transformation & Pre-Processing:

One of the main steps in the predictive analytics is data transformation. Data is never in the way you want them. One might have to do some kind of transformations to get it to the way we need them either because the data is dirty, not of the type we want, out of bounds, and a host of other reasons

This first transformation we’ll need to do is to convert the categorical variables into dummy variables.

The four categorical variables of interests are: 1) Carrier 2) Destination (airport codes) 3) Origin (airport codes) 4) Day of the Week. For simplicity of model building, we’ll NOT use Day of the Month, because of the combinatorial explosion in number of dummy variables. The reader is free to do this as an exercise on his/her own. :)

Here’s the first five rows of categorical to dummy variables transformation. There’s a nice handy function in R called model.matrix that helps us with that.

flights.dummy <- model.matrix(~CARRIER+DEST+ORIGIN+DAY_WEEK,data=flights) flights.dummy <- flights.dummy[,-1] head(flights.dummy,5)

## CARRIERDH CARRIERDL CARRIERMQ CARRIEROH CARRIERRU CARRIERUA CARRIERUS
## 1 0 0 0 1 0 0 0
## 2 1 0 0 0 0 0 0
## 3 1 0 0 0 0 0 0
## 4 1 0 0 0 0 0 0
## 5 1 0 0 0 0 0 0
## DESTJFK DESTLGA ORIGINDCA ORIGINIAD DAY_WEEK2 DAY_WEEK3 DAY_WEEK4
## 1 1 0 0 0 0 0 1
## 2 1 0 1 0 0 0 1
## 3 0 1 0 1 0 0 1
## 4 0 1 0 1 0 0 1
## 5 0 1 0 1 0 0 1
## DAY_WEEK5 DAY_WEEK6 DAY_WEEK7
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0


Then we need to cut/segment the DEP_TIME into sensible buckets. In this case, I’ve divided by hour bucket. Then we need to convert those buckets into dummy variables.

## HourBlockDeptTimeHourBlock1 HourBlockDeptTimeHourBlock10
## 1 0 0
## 2 0 0
## 3 0 0
## HourBlockDeptTimeHourBlock11 HourBlockDeptTimeHourBlock12
## 1 0 0
## 2 0 0
## 3 0 1
## HourBlockDeptTimeHourBlock13 HourBlockDeptTimeHourBlock14
## 1 0 1
## 2 0 0
## 3 0 0
## HourBlockDeptTimeHourBlock15 HourBlockDeptTimeHourBlock16
## 1 0 0
## 2 0 1
## 3 0 0
## HourBlockDeptTimeHourBlock17 HourBlockDeptTimeHourBlock18
## 1 0 0
## 2 0 0
## 3 0 0
## HourBlockDeptTimeHourBlock19 HourBlockDeptTimeHourBlock20
## 1 0 0
## 2 0 0
## 3 0 0
## HourBlockDeptTimeHourBlock21 HourBlockDeptTimeHourBlock22
## 1 0 0
## 2 0 0
## 3 0 0
## HourBlockDeptTimeHourBlock23 HourBlockDeptTimeHourBlock5
## 1 0 0
## 2 0 0
## 3 0 0
## HourBlockDeptTimeHourBlock6 HourBlockDeptTimeHourBlock7
## 1 0 0
## 2 0 0
## 3 0 0
## HourBlockDeptTimeHourBlock8 HourBlockDeptTimeHourBlock9
## 1 0 0
## 2 0 0
## 3 0 0

Let’s join all the variables into one big data frame so that later on we can feed into our logistic regression.

## CARRIERDH CARRIERDL CARRIERMQ CARRIEROH CARRIERRU CARRIERUA CARRIERUS
## 1 0 0 0 1 0 0 0
## 2 1 0 0 0 0 0 0
## DESTJFK DESTLGA ORIGINDCA ORIGINIAD DAY_WEEK2 DAY_WEEK3 DAY_WEEK4
## 1 1 0 0 0 0 0 1
## 2 1 0 1 0 0 0 1
## DAY_WEEK5 DAY_WEEK6 DAY_WEEK7 HourBlockDeptTimeHourBlock1
## 1 0 0 0 0
## 2 0 0 0 0
## HourBlockDeptTimeHourBlock10 HourBlockDeptTimeHourBlock11
## 1 0 0
## 2 0 0
## HourBlockDeptTimeHourBlock12 HourBlockDeptTimeHourBlock13
## 1 0 0
## 2 0 0
## HourBlockDeptTimeHourBlock14 HourBlockDeptTimeHourBlock15
## 1 1 0
## 2 0 0
## HourBlockDeptTimeHourBlock16 HourBlockDeptTimeHourBlock17
## 1 0 0
## 2 1 0
## HourBlockDeptTimeHourBlock18 HourBlockDeptTimeHourBlock19
## 1 0 0
## 2 0 0
## HourBlockDeptTimeHourBlock20 HourBlockDeptTimeHourBlock21
## 1 0 0
## 2 0 0
## HourBlockDeptTimeHourBlock22 HourBlockDeptTimeHourBlock23
## 1 0 0
## 2 0 0
## HourBlockDeptTimeHourBlock5 HourBlockDeptTimeHourBlock6
## 1 0 0
## 2 0 0
## HourBlockDeptTimeHourBlock7 HourBlockDeptTimeHourBlock8
## 1 0 0
## 2 0 0
## HourBlockDeptTimeHourBlock9 Weather FlightStatus
## 1 0 0 ontime
## 2 0 0 ontime

4. Model Building: Logistic Regression

Now, it’s generally NOT a good idea to use your ENTIRE data sample to fit the model. What we want to do is to train the model on a sample of the data. Then we’ll see how it perform outside of our training sample. This breaking up of our data set to training and test set is to evaluate the performance of our models with unseen data. Using the entire data set to build a model then using the entire data set to evaluate how good a model does is a bit of cheating or careless analytics.

We use the a RANDOM sample that is 60% of the data set as the training set. Let’s take a peek at the first 5 rows of the training set.

## CARRIERDH CARRIERDL CARRIERMQ CARRIEROH CARRIERRU CARRIERUA CARRIERUS
## 2 1 0 0 0 0 0 0
## 3 1 0 0 0 0 0 0
## 4 1 0 0 0 0 0 0
## 7 1 0 0 0 0 0 0
## 8 1 0 0 0 0 0 0
## DESTJFK DESTLGA ORIGINDCA ORIGINIAD DAY_WEEK2 DAY_WEEK3 DAY_WEEK4
## 2 1 0 1 0 0 0 1
## 3 0 1 0 1 0 0 1
## 4 0 1 0 1 0 0 1
## 7 1 0 0 1 0 0 1
## 8 1 0 0 1 0 0 1
## DAY_WEEK5 DAY_WEEK6 DAY_WEEK7 HourBlockDeptTimeHourBlock1
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## HourBlockDeptTimeHourBlock10 HourBlockDeptTimeHourBlock11
## 2 0 0
## 3 0 0
## 4 0 0
## 7 0 0
## 8 0 0
## HourBlockDeptTimeHourBlock12 HourBlockDeptTimeHourBlock13
## 2 0 0
## 3 1 0
## 4 0 0
## 7 1 0
## 8 0 0
## HourBlockDeptTimeHourBlock14 HourBlockDeptTimeHourBlock15
## 2 0 0
## 3 0 0
## 4 0 0
## 7 0 0
## 8 0 0
## HourBlockDeptTimeHourBlock16 HourBlockDeptTimeHourBlock17
## 2 1 0
## 3 0 0
## 4 0 1
## 7 0 0
## 8 1 0
## HourBlockDeptTimeHourBlock18 HourBlockDeptTimeHourBlock19
## 2 0 0
## 3 0 0
## 4 0 0
## 7 0 0
## 8 0 0
## HourBlockDeptTimeHourBlock20 HourBlockDeptTimeHourBlock21
## 2 0 0
## 3 0 0
## 4 0 0
## 7 0 0
## 8 0 0
## HourBlockDeptTimeHourBlock22 HourBlockDeptTimeHourBlock23
## 2 0 0
## 3 0 0
## 4 0 0
## 7 0 0
## 8 0 0
## HourBlockDeptTimeHourBlock5 HourBlockDeptTimeHourBlock6
## 2 0 0
## 3 0 0
## 4 0 0
## 7 0 0
## 8 0 0
## HourBlockDeptTimeHourBlock7 HourBlockDeptTimeHourBlock8
## 2 0 0
## 3 0 0
## 4 0 0
## 7 0 0
## 8 0 0
## HourBlockDeptTimeHourBlock9 Weather FlightStatus
## 2 0 0 ontime
## 3 0 0 ontime
## 4 0 0 ontime
## 7 0 0 ontime
## 8 0 0 ontime

5. Results with Training Data

Now, let’s feed the training data (60% of our total data set) into our logistic regression model:

##
## Call:
## glm(formula = FlightStatus ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9293 0.2287 0.4632 0.6330 1.4940
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.00948 2399.54481 -0.008 0.994012
## CARRIERDH 1.22308 0.59115 2.069 0.038549 *
## CARRIERDL 1.85836 0.54340 3.420 0.000627 ***
## CARRIERMQ 0.62141 0.51715 1.202 0.229517
## CARRIEROH 2.21278 0.95203 2.324 0.020111 *
## CARRIERRU 1.05671 0.44686 2.365 0.018043 *
## CARRIERUA 1.59265 0.98043 1.624 0.104283
## CARRIERUS 2.10368 0.55575 3.785 0.000154 ***
## DESTJFK -0.02558 0.36198 -0.071 0.943666
## DESTLGA -0.28915 0.36769 -0.786 0.431636
## ORIGINDCA 1.48927 0.42527 3.502 0.000462 ***
## ORIGINIAD 0.50949 0.40285 1.265 0.205968
## DAY_WEEK2 0.27817 0.28915 0.962 0.336039
## DAY_WEEK3 0.43535 0.28308 1.538 0.124070
## DAY_WEEK4 0.48144 0.27512 1.750 0.080136 .
## DAY_WEEK5 0.43374 0.27570 1.573 0.115672
## DAY_WEEK6 1.28499 0.37066 3.467 0.000527 ***
## DAY_WEEK7 -0.02538 0.28445 -0.089 0.928917
## HourBlockDeptTimeHourBlock1 NA NA NA NA
## HourBlockDeptTimeHourBlock10 18.07704 2399.54479 0.008 0.993989
## HourBlockDeptTimeHourBlock11 17.45879 2399.54487 0.007 0.994195
## HourBlockDeptTimeHourBlock12 18.09730 2399.54478 0.008 0.993982
## HourBlockDeptTimeHourBlock13 16.47483 2399.54476 0.007 0.994522
## HourBlockDeptTimeHourBlock14 17.78484 2399.54477 0.007 0.994086
## HourBlockDeptTimeHourBlock15 15.96003 2399.54477 0.007 0.994693
## HourBlockDeptTimeHourBlock16 16.99874 2399.54476 0.007 0.994348
## HourBlockDeptTimeHourBlock17 16.99070 2399.54476 0.007 0.994350
## HourBlockDeptTimeHourBlock18 16.69408 2399.54477 0.007 0.994449
## HourBlockDeptTimeHourBlock19 15.49419 2399.54478 0.006 0.994848
## HourBlockDeptTimeHourBlock20 16.30981 2399.54479 0.007 0.994577
## HourBlockDeptTimeHourBlock21 17.16271 2399.54477 0.007 0.994293
## HourBlockDeptTimeHourBlock22 -0.95332 2538.73457 0.000 0.999700
## HourBlockDeptTimeHourBlock23 -0.51317 2680.37215 0.000 0.999847
## HourBlockDeptTimeHourBlock5 17.46843 2399.54490 0.007 0.994192
## HourBlockDeptTimeHourBlock6 17.90277 2399.54478 0.007 0.994047
## HourBlockDeptTimeHourBlock7 17.51392 2399.54479 0.007 0.994176
## HourBlockDeptTimeHourBlock8 17.94252 2399.54478 0.007 0.994034
## HourBlockDeptTimeHourBlock9 16.07116 2399.54479 0.007 0.994656
## Weather -17.74165 493.32979 -0.036 0.971312
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1312.7 on 1319 degrees of freedom
## Residual deviance: 1034.1 on 1282 degrees of freedom
## AIC: 1110.1
##
## Number of Fisher Scoring iterations: 15


The following variables are significant in predicting flights delay according to the model output above:
  • DAY_WEEK6 (Saturday)
  • Origin airport is DCA (Reagan National) Carrier is US Airways
  • Carrier is Delta
  • Carrier is Comair
  • Carrier is Continental Express

Interestingly, the Hour of the Day has no statistical significance in predicting flights delay.

6. Model Evaluation: Logistic Regression

The real test of a good model is to test the model with data that it has not fitted. Here’s where the rubber meets the road. We apply our model to unseen data to see how it performs.

6.1 Prediction using out-of-sample data.

Let’s feed the test data(unseen) to our logistic regression model.

Confusion Matrix We use the confusion matrix to see the performance of the binary classifier, which is what logistic regression is used in this example. Please kindly note that logistic regression can be used for more than binary classification(multi-classes).

Please check out this nice Wikipedia explanation of the Confusion Matrix.

The diagonals of the confusion matrix are the true positive and true negative. The model predicts 42 delays and 706 ontime and it got it right!

On the hand, it predicted 125 flights to be delayed but it was on time. And it predicted 8 to be ontime, but they were actually delayed.

There are three metrics that people look at:

  1. sensitivity (true positive rate or recall): True Positive/(True Positive + False Negative)
  2. specificity (true negative rate): True Negative/(False Positive + True Negative)
  3. accuracy: (True Positive + True Negative)

## Warning: package 'ROCR' was built under R version 3.1.3
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Confusion Matrix and Statistics
##
## Reference
## Prediction delayed ontime
## delayed 42 125
## ontime 8 706
##
## Accuracy : 0.849
## 95% CI : (0.8237, 0.872)
## No Information Rate : 0.9432
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3284
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.84000
## Specificity : 0.84958
## Pos Pred Value : 0.25150
## Neg Pred Value : 0.98880
## Prevalence : 0.05675
## Detection Rate : 0.04767
## Detection Prevalence : 0.18956
## Balanced Accuracy : 0.84479
##
## 'Positive' Class : delayed
##


Lift Chart (ROC - Receiver Operating Curve):

This is a graphical representation of the relationship between the sensitivity and the false positive rate. The sensitivy is how many correct positive results occur among all positive samples available. False positive rate, on the other hand, is how many incorrect positive results occur among all negative samples available.

The BEST possible prediction model would yield a point in upper left corner(0,1). This would represent perfect classification with no false negative(sensitivity) and no false positive(specificity). So, the higher the ROC or Lift the better the model. The closer the curve is to the straight line the worst it is and closer to random. 

Accuracy

This represents the overall accuracy of the classifier. This can be misleading if there are many more negatives than positives.

7. Conclusion

Hope you enjoyed this and are excited in applying predictive analytics models to your problem space.

In follow on blogs I’ll explain in further details the theories behind these methods and the differences and similarities between them.

Originally posted here.

Views: 17112

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service