There seems to be very little overlap currently between the worlds of infosec and machine learning. If a data scientist attended Black Hat and a network security expert went to NIPS, they would be equally at a loss.
This is unfortunate because infosec can definitely benefit from a probabilistic approach but a significant amount of domain expertise is required in order to apply ML methods.
Machine learning practitioners face a few challenges for doing work in this domain including understanding the datasets, how to do feature engineering (in a generalizable way) and creation of labels.
A variety of datasets can be collected as a precursor to creating a training set for a machine learning model:
Some of these sources (like log formats) are readily available and fairly standardized while others will require extensive tooling and software modifications (e.g. application logging), and yet others will require a significant hardware footprint and a monitoring network that could rival the size of the real network.
Bearing in mind that the whole point of machine learning is generalization beyond the training set, thoughtful feature engineering is required to go from the identity information of IP addresses, hostnames and URLs to something that can turn into a useful representation within the machine learning model.
For example the following might be a useful feature space created from proxy logs (Franc)