What is Cybersecurity Data Science?

Cybersecurity Data Science (CSDS) is a rapidly emerging profession focused on applying data science to prevent, detect, and remediate expanding and evolving cybersecurity threats. CSDS is increasingly formally recognized as a cybersecurity job specialty, for instance in the NIST NICE Cybersecurity Workforce Framework.csds_post

A proposed CSDS definition derived from multiple sources:

CSDS is the practice of data science to assure the continuity of digital devices, systems, services, software, and agents in pursuit of the stewardship of systemic cybersphere stability, spanning technical, operational, organizational, economic, social, and political contexts.

Some general aspects of CSDS as an emerging professional practice:

  • CSDS is data focused, applies quantitative, algorithmic, and probabilistic methods, attempts to quantify risk, focuses on producing focused and efficacious alerts, promotes inferential methods to categorize behavioral patterns, and ultimately seeks to optimize cybersecurity operations.
  • CSDS represents a partial paradigm shift from traditional cybersecurity approaches, which are rule-and-signature-based and focus on boundary protection.
  • CSDS seeks situational awareness and assumes persistent and prolific threats which may be human, automated, or ‘cyborg’ in origin.
  • CDSD goals connect historically with cybersecurity continuous monitoring and forensics functions in particular.
  • CSDS has emerged from two parent domains which themselves are undergoing rapid transformation. As such, the ‘body of theory’ surrounding CSDS is still evolving.
  • CSDS has evidenced pragmatic successes using analytics and machine learning in focused use-cases such as spam filtering, phishing email detection, malware and virus detection, network monitoring, and endpoint protection.
  • Applied CSDS involves addressing cybersecurity challenges with data science prescriptions and implies a gap analysis is conducted.

Research on CSDS by this author has revealed that practitioners perceive key challenges which must be addressed to advance the domain. Among the central challenges is the perception that the field must develop more rigorous scientific methods. Many CSDS practitioners work in high-pressure, time-driven environments, whereas advancement of the domain demands the development of best practices resulting from experimentation, testing, and core research.

Other perceived challenges in the domain include challenges related to data management: gathering, integrating, cleaning, transforming, and extrapolating key measures from the fragmented, voluminous, and fast-streaming sources that underlie modern cyber infrastructure.

Finally, practitioners report that the sheer breadth of the cybersecurity and data science domains combined has become so broad and complex that even seasoned professionals can only hope to gain expertise in specific areas combined with a general understanding of others. This ultimately necessitates that institutions and organizations develop approaches to ensure cross-domain collaboration and process-driven teamwork across hybridized teams of CSDS professionals.

A key point of advocacy related to CSDS is that it can be seen as a central mechanism for measuring cyber risk, which is a prerequisite to controlling and preventing exposure, improving alerting and triage, and generally optimizing cybersecurity detection and remediation operations. In terms of quantifying cyber risks, there are a broad range of attacks and attackers, everything from accidental incidents and company insiders; through to extortion and cyber-fraud; to corporate espionage and state sponsored cyber actors.

To understanding the risks of exposure, frequency, and impact from events, it is important to develop a contextual understanding of at-risk targets and their relative susceptibilities. From here, one can match an understanding of the permissible means of attack, opportunities for staging attacks and incursions, and the corresponding motivations of prospective attackers (e.g. competitive advantage, revenge, personal gain, ‘just a job’).

Hubbard’s ‘How to measure Anything in Cybersecurity’ is a good overview to quantifying cyber risk: How To Measure Anything in Cybersecurity Risk, Hubbard, D

Some attacks are more or less ‘pure’ cyber attacks (incidents intending to interrupt or destroy computer and communications infrastructure), whereas, beyond this, cyber is fast becoming a common mechanism and medium for perpetrating fraud and crime of many types.

Common and growing examples of cyber-fraud include focused spear phishing (targeted compromise of people and/or systems through their identity and computer/systems access to perpetrate and facilitate financial fraud), ransomware (commandeering and ‘locking’ systems and data until a ransom is paid: Global Ransomware Damage Costs Predicted To Hit $11.5 Billion By 2019), and cryptojacking (surreptitiously commandeering computing resources to mine crypto currencies: Criminals’ Cryptocurrency Addiction Continues).

In terms of tracking trends, the Ponemon Institute has a set of reports on data breaches. Tracking and trends depends upon specifying the particular type of attack and knowing the target. There are a broad range of online reports tracking trends for different types of cyber threats.

In terms of the growing scope of the vulnerabilities, Bruce Schneier ‘Click Here to Kill Everybody: Security and Survival in a Hyper-conn… is sobering account of the risks we increasingly face.

To overcome these growing risks, CSDS must evolve and mature as a professional practice. This will involve developing more rigorous approaches with a basis in science. In the longer term, it is expected CSDS will generate domain theory and standard practices based upon experimentation and field testing.

Concerning CSDS academic research, much is currently focused on technical case studies and framing / testing advanced analytical and machine learning methods.

For a practitioner new to the domain, more substantial guidance comes in book form. The following books have come out in the last decade and can be considered to represent a key CSDS corpus:

  1. Intrusion Detection: A Machine Learning Approach (Yu & Tsai, 2011)
  2. Data Mining and Machine Learning in Cybersecurity (Dua & Du, 2011)
  3. Network Anomaly Detection: A Machine Learning Perspective (Bhattacharyya & Kalita, 2013)
  4. Applied Network Security Monitoring: Collection, Detection, and Ana…(Sanders & Smith, 2013)
  5. Network Security Through Data Analysis (Collins, 2014)
  6. Data Analysis for Network Cyber-Security (Adams & Heard, 2014)
  7. Data-Driven Security (Jacobs & Rudis, 2014)
  8. Fraud Analytics Using Descriptive, Predictive, and Social Network T…(Baesens et al., 2015)
  9. Essential Cybersecurity Science (Dykstra, 2016)
  10. Dynamic Networks and Cyber-Security: Security Science and Technolog… (Adams & Heard, 2016)
  11. Cybersecurity and Applied Mathematics (Metcalf & Casey, 2016)
  12. How to Measure Anything in Cybersecurity Risk (Hubbard & Seiersen, 2016)
  13. Data Analytics and Decision Support for Cybersecurity (Carrascosa, Kalutarage, & Huang, 2017)
  14. Introduction to Machine Learning with Applications in Information S…(Stamp, 2017)
  15. Information Fusion for Cyber-Security Analytics (Alsmadi, Karabatis, & AlEroud, 2017)
  16. Machine Learning & Security: Protecting Systems with Data and A…(Chio & Freeman, 2018)
  17. Data Science for Cybersecurity (Heard, Adams, Rubin-Delanchy, & Turcotte, 2018)
  18. AI in Cybersecurity (Leslie F. Sikos, 2018)
  19. Malware Data Science: Attack Detection and Attribution (Saxe & Sanders, 2018)
  20. Machine Learning for Computer and Cyber Security (Gupta & Sheng, 2019)

In closing, a unique and concerning aspect of CSDS is that the mechanisms are equally suitable to black hat (adversarial) activities as they are to white hat (defender) counter-actions.

Adversarial machine learning (seeking to compromise ML systems and data) is a growing area of research and there are reports that attackers are already examining and testing machine learning methods to speed and automate attacks, for instance to automate vulnerability scanning and command and control of malware.

The manifest future likely will evolve towards the application of CSDS to create and manage AI / machine learning systems to both attack and defend, leading potentially to quite sophisticated semi-autonomous platforms engaging one another.


These points and more will be expanded upon by the author in a forthcoming book ‘Cybersecurity Data Science (CSDS)’.

#CSDS2020 #SARK7 #LinkedIn/SMongeau

Scott Allen Mongeau