Subscribe to DSC Newsletter

Analysis of 2 Million Hijacked Passwords (in Python)

Posted by Jianhua Li on GitHub. This was proposed as a data science project on Data Science Central, to challenge your data science skills on a real data set. Below is an overview. 

Basically one should try to answer the following three questions:

  • What are the most common patterns found in passwords?
  • Based on these patterns, how to build robust yet easy-to-remember passwords?
  • Does this password data set look OK, or do you think it is somewhat inaccurate or not representative of the password universe? If not, can we still draw valid conclusions from this data set, and how?

Data is available here

Step 1. Load Packages

In [282]: 

Step 2. Load Data

In [2]:

Server: nginx Date: Tue, 22 Nov 2016 00:39:04 GMT Content-Type: text/plain; charset=utf-8
Content-Length: 20163399
Last-Modified: Sun, 27 Mar 2016 05:04:06 GMT
Connection: close
Vary: Accept-Encoding
ETag: "56f769c6-133ab47"
Expires: Tue, 29 Nov 2016 00:39:04 GMT
Cache-Control: max-age=604800
Accept-Ranges: bytes

In [3]:

In [283]:

#

# This is a list of 2,151,220 unique ASCII passwords in sorted order according

# to their native byte values using UNIX sort command.
#
# This list (also known as wordlist, password dictionary or password list)
# is useful for password recovery tools such as John the Ripper, oclHashcat
# and Aircrack-ng. To use this file, be sure to first remove these comment
# lines, i.e. the lines starting with # character.
#
# If you are looking for a better password dictionary,
# see http://dazzlepod.com/uniqpass/
#
# $DateTime: 2016/03/27 16:04:06 $
#
# Comments/Questions? Send to [email protected]
#

In [311]:

What you will find in this article, besides the first two steps : 

Step 3. Explore the Data
Step 4. Data Preprocess
Step 5. Analyze the Data

  • 5.1. Analysis of the password length
  • 5.2. Analysis the basic composition of passwords
  • 5.3. Analysis the detailed composition of each password

Step 6. Classification

  • 6.1. Calculate the score
  • 6.2. Classification
  • 6.3. Plot the data according to label

Summary

The picture below is from the original (long) article.

To read the original article with source code, analysis and conclusions, click here

Top DSC Resources

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 12265

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Jianhua/Jason Li on December 4, 2016 at 4:12pm

The labeling of some panels was reversed, please check the source file. Thanks

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service