Subscribe to DSC Newsletter

There seems to be very little overlap currently between the worlds of infosec and machine learning. If a data scientist attended Black Hat and a network security expert went to NIPS, they would be equally at a loss. 

This is unfortunate because infosec can definitely benefit from a probabilistic approach but a significant amount of domain expertise is required in order to apply ML methods.

Machine learning practitioners face a few challenges for doing work in this domain including understanding the datasets, how to do feature engineering (in a generalizable way) and creation of labels.

Available Datasets

A variety of datasets can be collected as a precursor to creating a training set for a machine learning model:

  • Log files from systems, firewalls, proxies, routers, switches that capture in semi-structured formats network activity and user behavior
  • Application level logging and diagnostics that record user/system access information and application usage
  • Monitoring tools, IDS systems and SIEMs
  • Network Packet Capture (PCAP) is a rather compute/storage intensive process of recording the raw ethernet frames

Some of these sources (like log formats) are readily available and fairly standardized while others will require extensive tooling and software modifications (e.g. application logging), and yet others will require a significant hardware footprint and a monitoring network that could rival the size of the real network.

Feature Engineering

Bearing in mind that the whole point of machine learning is generalization beyond the training set, thoughtful feature engineering is required to go from the identity information of IP addresses, hostnames and URLs to something that can turn into a useful representation within the machine learning model.

For example the following might be a useful feature space created from proxy logs (Franc)

  • length
  • digit ratio
  • lower case ratio
  • upper case ratio
  • vowel changes ratio
  • has repetition of '&' and '='
  • start with number
  • number of non-base64 characters
  • has a special character
  • max length of consonant stream
  • max length of vowel stream
  • max length of lower case stream
  • max length of upper case stream
  • max length of digit stream
  • ratio of a character with max occurrence
  • (session) duration
  • HTTP request status
  • is URL encrypted
  • is protocol HTTPS
  • number of bytes up
  • number of bytes down
  • is URL in ASCII
  • client port number
  • server port number
  • user agent length
  • MIME-Type length
  • number of '/' in path
  • number of '/' in query
  • number of '/' in referrer
  • is the second-level domain raw IP

Read more

Views: 753

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Videos

  • Add Videos
  • View All

Follow Us

© 2018   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service