The objective of my final project at Metis from weeks 9 to 12, is to categorize drivers based on their behaviour on the roads - their driving style and the type of roads that they follow.
The challenge associated with this objective is to identify uniquely a driver (and hence his proper “driving behaviour”) based on the GPS log of a mobile phonelocated inside the car.
My idea to solve this issue is to experiment Topic Modeling techniques especially Latent Semantic Indexing/Analysis (LSI/LSA) and Latent Dirichlet Allocation(LDA) and explain the observed trips by the unobserved behaviour of drivers.
The following is an executive summary … you can also browse throught the ppt that I am presenting at Metis on the 7th of April 2015 during the Career Event, or check the Python code available on my blog: http://nasdag.org
The raw data received for each trip is a csv file of (x,y) coordinates logged every one second.
My approach consists of first preprocessing the data using statistical smoothing and compression algorithms:
- Kalman Filtering and
then extracting Road and Driving Style features:
- per Segment: Length, Slip Angle, Convexity, Radius
- per Meter: Speed, Accelerations (tangential and normal), Jerk, Yaw, Pauses
then, binning the ouput to generate the “Driving Alphabet” (ex: d0, d1, d2… v0, v1, v2… a0, a1, a2… etc),
and finally, building the Driving Vocabulary - made of “Driving Slides” (ex: d3L4v2n3y1) for various preprocessing sensitivities or features combinations (the langages).
Then I translate trips from GPS log into documents; tokenize, filter, … the data is ready!
I will use the GENSIM library to transpose trips into an LDA or LSI space where each trip becomes a combination of “Driving Behaviours” made of “Driving Slides”.
In order to validate my model I am using it to compete in the AXA Kaggle challenge where I need to come up with a “telematic fingerprint” capable of distinguishing when a trip was driven by a given driver, knowing that among the 200 provided trips for each of the 2736 drivers, a few number of trips was not driven by this driver.
Submissions are judged on area under the ROC curve calculated in a global manner (all predictions together).
My approach is the following:
- transpose all trips into the new Driving Behaviours Space
- take one by one each trip from a selected Driver
- build a prediction model trained with all other trips in the dataset:
Trues if they belong to the selected Driver
Falses if they do not belong to this Driver
- predict with the trained model, the belonging of the selected Trip to the Driver, then Ensemble several predictions using various sensitivities to enhance the score …
For performance reasons I will proceed by batches of 10 or 20 selected trips and compare each time to a randomly selected limited number of False trips.
Other outlier detection / clustering techniques appear to be less performing
3.3 M generated documents are kept in a MongoDB and parallel processing set up on 4 DigitalOcean Droplets with 8CPU each.
An AUC of 0.9 has been measured by Kaggle without any ensembling technique which confirms the robustness of this approach …