Change is inevitable and so are failures
The changes in technology are happening at an incredible pace. Every day there are new software languages, system processes, cloud platforms, software capabilities, etc. introduced to the digital world in some way. This new technology is being seamlessly integrated with technology built yesterday.
It is impressive and somewhat breathtaking when these various systems work together in harmony however they will fail sooner or later. They will not meet the SLA for service and a Root Cause Analysis (RCA) will be initiated to determined root cause. The investigating Engineers will rely on tools, techniques and experience based on yesterday, attempting to solve a problem not yet discovered today and preventing it from happening again tomorrow.
The speed at which a software problem is resolved is related to how fast the forensic data that is available is surfaced to the right source or person. For example, a Windows event log may have enough information to resolve a given problem but it may not be reviewed by the right Engineer in a timely manner. An Operations team may overlook a minor network failure and it takes days for the Engineers investigating an application failure to make this connection to a costly software failure. It is my contention that the information is always there but it is not always presented to the right resource, in the right format, in a timely manner.
Data science and software forensics
When investigating a software failure, the Developer is no longer acting as a Developer but acts in the role of a Data Scientist. They are involved with the exploration and the quantitative analysis of all available data, structured and unstructured to develop understanding, extract knowledge and formulate actionable results. This is fundamentally what is happening when troubleshooting software problems.
Gather as much data as possible around the software failure
Clean and/or transform the data
Organize the data into meaningful categories
State data and time based
Directly related or indirectly related
Causative and symptomatic
Build relationships based on the data and separate the noise
Build supported hypothesis on the failure according to the relevant data
Deliver results that can be used to provide value and improve the software design
While data analysis is qualitative research it can include statistical procedures. Investigating a software failure commonly involves analysis being an ongoing iterative process where data is continuously collected and analyzed almost simultaneously. The analysis focuses on finding patterns in observations through all of the available data.
Change is difficult
Cloud platforms are making scaling up hardware trivial while our tools and processes to troubleshoot software running on these platforms is lagging. Companies are quickly rolling out PaaS solutions and struggling when it comes time to perform problem isolation and/or RCA on these software solutions.
Historically, the general focus of software supportability is around presenting the data as needed in disparate ways. This creates the ever evolving need for specialized Engineers, tools and processes. Development teams focus on creating new features and capabilities while software maintenance is an after-thought.
The bright spot is that this is changing and no longer acceptable in today’s fast moving businesses. This gap is being closed with the advent of Devops, cloud platforms, etc. however there is still a need to go even further. The elephant in the room is data analysis, automating what Engineers do today.
Normalizing the various tracing and monitoring outputs into a common format is the first step to getting a holistic view of a given software solution running in any environment. With easy access to cloud processing, primitives, storage and ever evolving feature set it is easy to envision building a diagnostic capability on top of today’s software diagnostics ability in a non-intrusive way. This would be part of the building blocks for creating a unified diagnostics strategy/platform that provides rudimentary data analysis across the platform systems then vertically up the application stack.
We are going to have to think out of the box and holistically about software supportability/diagnostics. If we want to keep up with the pace change and get the best ROI then the thinking about software supportability in a silo has to change