Have you ever wondered why some URLs start with ‘https’ while others start with ‘http’ or why some websites have this padlock before the URL?
The technology behind https is the Secure Sockets Layer certificate (SSL certificate). This is a security protocol used to secure data between the web server and browser using encryption. Millions of websites are using these certificates to establish secure connections and keep their customer’s data safe.
SSL certificates contain the website’s public key and identity, along with other related information.
Any business that wants to secure their website has to buy an SSL certificate issued from a trusted Certificate Authority (CA), an entity that issues SSL certificates after background checks into the business and its owner.
SSL certificates encrypt the data that flows between the client and the web server by initiating the authentication process called a handshake during which both client and server verifies each others’ identity by using their public and private keys.
In the past users are taught to look for the https and padlock to ensure that the website is safe. But what if someone was able to perfectly mock a website and could get hold of a certificate as well? This is the approach that phishers take to try and dupe users with sites having valid SSL certificates in order to make their attacks appear more legitimate, all thanks to the emergence of free CA and self-signed certificates.
The key to these attacks is the visual cue in the browser that will tell victims that the site they’re on is ok. In fact, the browser is just indicating that the site has a valid digital certificate and that the connection is encrypted; it doesn’t mean the site isn’t malicious.
Two major companies which have been the victim of the phishing attacks using SSL Certificates are: PayPal and Apple. PhishLabs in its research found that more than a quarter of PayPal phishing attacks and 18% of Apple phishing attacks used SSL certificates.
It is surprisingly easy for anyone to acquire a certificate but the knowledge of how trusted a certificate provider is cannot be known by the browsers. Luckily in the last few years providers are ensuring that the certificates they provide expire in line with industry standards more regularly. This forces users and organisations to renew more often. However it can also cause outages, with major financial loss if the SSL certificates do expire [Microsoft example] [Spotify example], highlighting the importance for organisations to stay on top of their certificate deadlines. It also ensures that attacks like the one we outlined above or man-in-the-middle are harder. As ever, attackers always find a way through.
Wandera already has an established Phishing detection algorithm which at the time of this research did not take into account information from the certificate. As we gathered feedback from our customers and through our Threat Intelligence team we saw patterns emerging in our missed detections that could have been found through some automatic analysis of the certificates. This triggered us to investigate this set of features to extend our phishing detection.
As with all Data Science (DS) and Machine Learning (ML) projects we had to consider what data we had to solve our problem. In this instance we were looking for both safe and compromised SSL certificates. Thanks again to our Threat Intelligence team outputs and our live detections we had a good basis of known records with and without certificates.
Additionally we leverage the detections from third parties like Openphish and Phishtank. Specifically for this classification task we considered another data source from Cali Dog Security who provide access to newly registered certificates in an effort to provide transparency across the Internet.
Combining these sources initially gave us both volume and variation of data to try to find these risky certificates. However, we would still need to collect any missing certificates themselves from the domains, and if they had been taken down already this would mean we had lost a record. As you can see below we saw the number of records drop 38% but that would still be enough to continue our analysis
To detect these forms of attack we used the existing feature set used in Phishing Detection Algorithm, along with additional pieces of information extracted from the SSL certificates themselves.
The features we engineered from SSL certificates can be divided into various categories.
Temporal features — examples
Text features — examples
These initial features can be used to create various categories that may indicate risk:
We can then extract features like:
and many more. Finally there are other certificate metadata that we can use:
These can be engineered into various new features both statistical and categorical in nature. This generated a large number of features that we could train and test against as seen below in our correlation matrix. We have blurred out the feature names to maintain our intellectual property but the highly correlated features are those that indicated missing or expiration information within the certificate.
Early indications that our feature space might have some good linear dependencies.
The eternal Data Science issue is getting data labelled. By combining the feedback from our customers, the scores of our existing solution, community scoring and a handful of manual labelling exercises we were able to have enough labelled data of both clean and dirty records that could be used in our ML algorithms. This included having enough to perform at 5-fold cross validation and hence ensuring we did not overfit. The final labelled data consisted of 20,483 dirty certificates and 19,275 clean certificates. As you may notice we were able to synthesise an approximate 50:50 class split that meant we could ensure whichever algorithm analysed the data it would have enough information per class to split them. Of course, this isn’t the expected population in reality where we see many many more clean certificates but that is then how we test the model — using the life like class split to ensure our model could handle a real life batch or record.
We decided to experiment with a small selection of algorithms that we both understood but also we knew were less likely to overfit or would generalise well enough to our problem — tree or ensemble based.
Decision Tree: a good basic algorithm that is highly explainable but the weakest here in terms of generalisability.
Bagging with Decision Tree: adding the meta estimation of bagging makes this a great addition here but not as optimised as the industry favourite…
Random Forest: well known for its amazing results on Kaggle competitions and can be optimised even further to ensure very little over or under fitting.
XGBoost: another DS favourite building on top of the classic ensembles with a version of gradient boosting that has been shown to be very effective at many tasks but harder to tune and if the data is not engineered well or noisy can mess with the fitness.
To compare these models we kept a real world balanced validation data set held out and used the F beta score with beta = 0.5 since we want to ensure the fewest number of False Positives (FPs) in this problem — we are ok with fewer detections if we do not flood the customer with blocked domains that should be simply accessible. Hence we also looked at the number of FPs across various probability thresholds, not just taking the default p=0.5, so we could adjust our level of risk given the detections.
For each algorithm we performed a Stratified k-fold cross validation to ensure we had the optimal hyper-parameters and then compared these best outcomes across all the algorithms.
As you can see below the XGBoost and Random Forest algorithms performed best in terms of maximising F0.5 yet minimising the number of FPs, especially if we were to configure the detection to be raised only for high probability records.
Since we are already detecting phishing through other ML mechanisms, to get the most value from this new mechanism we were most interested in the domains that we would have previously classified as False Negatives (marked as clean but in fact are dirty) and plus those our existing mechanisms are less sure about. This data acted as our final test data here since it has been confirmed as potential phishing with some level of confidence.
As you can see above, over a six month period using this false negative test data Wandera detected that 528,967 unique phishing domains were registered. Of these almost 12,000 were later seen in use within our customer traffic. And then of these customer attacks only 47 were verified by third parties in the community. This shows how widespread this attack vector actually is and that we can detect them.
The results of both the modelling and testing above has shown that we can effectively detect phishing records from their SSL certificates. And that when we added this to our detection mechanisms we got a serious uplift in new detections. We did also make sure our model got the feedback it needed from our QAs, Threat Intelligence team and customers. So that when the model was retrained it has the most up to date labels and hence makes better detections once deployed.
The only advice we can give is that you shouldn’t always trust that little lock in your browser but that many companies out there, us included, are fighting for a more secure internet!
Thanks to David Pryce as co-author