Contributed by David Letzler, Kyle Gallatin and Christopher Capozzola They enrolled in the NYC Data Science Academy 12-week full time Data Science Bootcamp program taking place between January 9th, 2017 and March 31st, 2017. The original article can be found here.
For this project, we took on the Two Sigma Connect: Rental Listing Inquiries Challenge on Kaggle. The rental website Renthop provided us with a csv of data from 120,000 listings and asked us to produce a model to predict whether a given listing would receive "low," "medium," or "high" interest. The model would be judged by predicting a test set, with the log-loss formula determining its effectiveness.
We approached the challenge ready to explore different machine-learning algorithms and creatively engineer the dataset. In the process, we learned:
We created several new features from the dataset immediately. These included –
1. Price per room: Price emerged quickly as a key predictor, with lower-rent apartments drawing more interest. We reasoned, though, that a 3-bedroom apartment would be more attractive than a 1-bedroom at a given price, so we mutated a price-per-room variable for each listing.
2. Number of features/photos/words: We also reasoned that listings with more material would draw more interest. Consequently, we created variables that counted the number of photos, the number of words in the description, and the number of listed features. (We also wrote a script to evaluate photo size, in lieu of downloading all 300,000 photos, but did not have time to include this in the final model.)
3. Date: Each listing had a time-stamp including both date and time of posting. Since we figured that the time of day might affect how visible the listing would be and that seasonal cycles would influence interest, we split these into two separate columns.
The "features" variable, which contained every distinct feature (e.g., Elevator, Cats/Dogs Allowed, etc.) for each apartment, proved more challenging. To analyze them separately, we unlisted and tabled the features, then exported a csv for visual inspection.
Though there nearly 1300 distinct feature tokens, many of these were effectively the same, whether due to the use of synonyms or alternate spellings of the same words. There were 13,000 “fitness center”s, but also a couple hundred “gym”s and a few dozen “health club”s. There were “live in super”s, “live-in superintendent”s, and “on-site super”s, and so on.
To correctly assign features to apartments, we made 54 new variables and used regular expressions to match as many features as we could to the listings. For example, we assigned “balcony” and “terrace” (both in the Top 20) to the same "private outdoor space" variable. We also pasted together the “description” and “features” variables, since over 1400 listings had no separate feature breakdown. Based on the table, we estimate our approach matched at least 99.5% of features listed in the original data to their listings.
To determine which features had any effect upon log-loss, we constructed a saturated Random Forest model with the new features (alongside the all-important price variable) on a subset of the training set. Next, we ordered the features by Gini importance in that model, then gradually pruned, tabulating the model’s log-loss on the rest of the set as we went.
As the plot shows, about 40 of the features manage to lower log-loss, though the reduction tapers at about 25. The decline, though, is quite incremental after more than a few predictors have been included. Proportionately, the features that removed the most log-loss were Prewar, Eat-In Kitchen, Dining Room, Dishwasher, No Fee, Hardwood Floors, Newly Renovated, Elevator, Marble Bathroom, and Cats Allowed.
To account for multicolinearity and efficiently incorporate as many of the useful features as possible, we conducted a Principal Component Analysis.
This scree-plot shows that while most of the features are independent, a handful seem to be co-variant, allowing us to get the predictive power of close to 40 features with only 30 principal components. For the final model, we incorporated a 30-PC feature-set.
We assumed that residential area would have a large effect on interest. Living closer to the center of the city is desirable, so long as the price isn't too high. As such, we decided to convert the latitude and longitude variables to neighborhoods. By physically listing out the name of the area, such as “East Village, NY,” you can use the ggmap library to get the lat-longs for the center of the neighborhood. Once you have a list of areas and their respective latitude and longitude, you can simply use KNN (k nearest neighbors) to assign the lat-longs in the data to neighborhoods. A count of listings by neighborhood – note that this is not normalized for size of that area – shows that Renthop listings are concentrated in midtown and downtown Manhattan.
Next, we can map the neighborhoods back to general areas of New York City and New Jersey. Below is a histogram of areas for the training set, with interest level as the fill.
Unfortunately, these area assignments did not improve our score much. The histogram explains why: there seems to be little correlation between area and interest level. All general areas (and even the smaller neighborhoods, for that matter) seem to have equivalent ratios of high, medium and low interest apartments.
Given that price was a strong predictor, we also attempted to assign a median price for each neighborhood. After grouping by both price and bedroom, we summarized with a median price and joined it back with the original data frame. Finally, we assigned a binary variable (“expensive”) that was "True" if the price of an apartment was above the neighborhood median and "False" otherwise.
Ultimately, it didn’t improve the score significantly, but it helped more than the neighborhoods themselves. While this work may have been unnecessary in the end, it was interesting to see a feature that seemed important turn out to have a small impact.
Finally, we were curious as to whether the effusiveness of the apartment descriptions affected interest level. Did it matter whether a listing's description emphasized the "stunning" view? To find out, we used the NRC sentiment lexicon (as implemented in Matthew Jockers's syuzhet package) to evaluate each listing on metrics like "trust" and "positivity." However, these seemed to have only a small effect in predicting interest.
Reasoning that apartment listings have their own specific sentiment language, we also tried to generate a real-estate-specific sentiment lexicon. By using keyness to compare the total word frequencies in the descriptions against the Brown Corpus of Standard American English (1960), we were able to highlight words that are particularly common in listing language, as seen in the WordCloud to the right.
In addition to general positive language like “beautiful” and “great,” words emphasizing size (e.g., “spacious”) and culture/convenience (e.g. “central”) stand out in the visualization. However, despite several efforts to score and/or classify these words, they did not improve the final model much. Apartment-seekers likely see through agents' spin.
Our most elaborate attempt at analyzing the language of listings involved making a sparse matrix of the high-frequency words in the text, then performing a logistic PCA to try to see whether there were any predictive linguistic patterns. Again, this seemed to have a minimal effect. Most of what the description had to tell us, it seems, was already in the model.
Our model selection process went through multiple stages. Initially, we used a collection of untuned algorithms to gauge a baseline for how these algorithms would work in terms of time, accuracy, and precision. For the Gradient Boosting Classifier, Random Forest Classifier, Support Vector Machine, and a Multi-class Logistic Regression, we used the Grid Search package in Python SKLearn in order to vary the range of values of the parameters as well as the set of features to be incorporated. Through the Cross-Validation packages, we were able to specifically optimize the algorithms to minimize log-loss.
We noticed immediately that the tree-based algorithms, like the Gradient Boosting Classifier and Random Forest, performed significantly better at the baseline and devoted more time to tuning these algorithms. Of the four models used in the exploratory analysis, the Gradient Boosting Classifier performed the best with an initial log-loss score of about 0.59. Given the success of the Gradient Boosting Classifier, we wanted to see if we could push our results further by using the XGBoost algorithm in R. Ultimately, we were able to successfully improve our results and chose this algorithm in our final model.
The caret package has a grid search similar to the one in python. Using this, it was easy to search through a few parameters. However, given the scale of the model and the number of final features columns, it was computationally expensive to run for a number of hyper parameters. Consequently, we didn’t use it to run a wide search but to narrow down options for parameters such as learn rate.
Caret can also create cross validation folds. Instead of running a time consuming cross validation, you can pull a fold out from the full training set and use it as your subtest in XGBoost. By adding it to the watchlist, you can assess the accuracy of your model on a smaller test set while training it. Of course, since this doesn’t compare all folds against each other like actual cross validation, it doesn’t account for variance between folds and is prone to some degree of error. Often, the test score could be lowered to 0.55 in script, but our submitted model received a Kaggle score closer to 0.6.
Our final model, then, was an XGBoost that included a) basic features like price, b) mutated features, like price-per-room and number of photos, and c) engineered features, including neighborhood designations and principal-component stand-ins for various features. This netted us a final score of 0.5625. It's a solid figure, though one much higher than the contest leaders. We came up with some good ideas, then, but we still have a ways to go before we're machine-learning gurus.