The healthcare industry was a pioneer in consistently applying data mining techniques and analytics procedures to identify areas subject to optimization and potential improvements of clinical practice. The research methodology was typically focused on accepting or discarding an initial hypothesis.
Figure 1. Traditional clinical research paradigm.
With big data becoming increasingly popular in the technology market, significant investments have been made to realize the potential of these new technologies in Healthcare. The ultimate goal has been to shift from the current research paradigm to a mature data management model that can provide a higher and faster Return On Investment.
Figure 2. Research paradigm based on big data principles.
In order to achieve this objective, it would be key that domain experts (doctors) could perform their own data mining activities in a simple, intuitive, flexible and agile way. In this context, Visual Analytics plays a key role.
By using a visual environment, a doctor can formulate questions to data and obtain an answer immediately. These questions, and the answers that are returned by the environment, can certainly lead to new insights and non-intuitive connections among clinical variables, or trends that remain invisible under traditional research methodologies.
In the next section, a Hadoop-based Visual Analytics environment is presented. This environment was designed, implemented and tested as part of a real-life clinical research exercise.
The use cases described next prove how an interactive visual approach can be leveraged to easily extract insights from data.
Our visual environment is based on Cloudera’s Hadoop (http://hadoop.apache.org) distribution (http://www.cloudera.com), and uses Hadoop User Experience (http://gethue.com) as framework to effectively design and implement an interactive dashboard.
The dashboard provides access to clinical data of elderly patients admitted into the Acute Unit of a hospital from 2006 to 2015. Data includes patient demographics, physical and mental status before and after admission, length of stay (LOS), diagnosis, treatment, clinical complications along admission, and administered drugs.
Data was stored in an HDFS cluster and indexed with Solr (http://lucene.apache.org/solr), in order to increase the speed of real-time queries.
User web-based interface is shown in figures 3 and 4.
Figure 3. User interface for patients admitted from any source.
Figure 4. User interface for patients admitted from nursing homes.
The user interface allows to explore the relationships among data and identify potential connections. In this process, it’s also feasible to detect what initial insights worth performing a more detailed statistical analysis, or using more sophisticated data mining techniques.
Next, several use cases are presented. These examples can be used to measure our environment’s ability to answer relevant clinical questions.
In order to answer this question, the value 428 (corresponding to heart failure) is selected from the list of ICD-9 codes, which relate to patient diagnosis.
Figure 5. Visual profiling of a patient admitted with heart failure.
Automatically, the environment updates the charts and indicators related to gender, age, year of admission, CRF (functional status), CRM (mental status) and Barthel index (disability level).
Given these values, it becomes apparent that the most frequent profile for this diagnosis is a patient being 85 to 95 years old, female, with a Barthel index higher than average.
Now, the value 486 is selected from the list of ICD-9 codes, which relates to pneumonia.
The line chart representing the number of admissions per year is updated to reflect the trend related to this disease, as shown in Figure 6.
Figure 6. Evolution of number of patients admitted with pneumonia.
The number of admissions related to pneumonia was 81 in 2007, and 230 in 2014, which means that it has multiplied by 4 in the last 7 years.
First, Hospital A is selected from the list of admission sources. As shown in Figure 7, the number of readmissions was 548 out of 4,248 admissions. Therefore, the readmission ratio is 12%.
For Hospital B, there were 164 readmissions out of 1,830 admissions. Therefore, the readmission ratio is only 8%.
Figure 7. Comparison of readmission ratios between two hospitals.
2014 is selected from the list of years. Charts are updated to display values related to that year. As seen in Figure 8, the workload was not equally distributed among doctors at all.
As a result, several doctors might not be able to spend the right amount of time with each patient, while others might not have enough workload to fill in for a full-time job.
Beyond this particular insight, other indicators reveal interesting details as well. For example, the fact that most patients are admitted by noon.
This might suggest that, within that timeframe, there must be more resources available to process admissions. Likewise, peak hours could be downsized to compensate for rush hours.
The other relevant fact is that two of the most common admission reasons are UTI (Urinary Tract Infection) and respiratory issues. This information can be very valuable when designing continuous training programs for doctors. Moreover, it might be the reason for the unequal distribution of workload among them, since doctors with more experience in those areas will likely be often appointed to treat these patients.
Figure 8. Indicators related to patients’ admissions in the Acute Unit.
The use of visual analytics environments to dissect information accelerates the insights discovery process, which in turn makes knowledge generation activities (or conceptual learning) more efficient.
Undoubtedly, a more rigorous analysis is required to provide reliable and generally applicable answers that could be safely applied to clinical practice.
However, the possibility to obtain preliminary results in such a short timeframe represents a significant advantage over traditional research.
Lastly, the fact that doctors can perform these data mining activities by themselves represents another significant advantage over more sophisticated approaches.
In order to realize the full potential that visual analytics can provide, it’s key to maximize the scope of data stored in the environment.
With more variables, the number of potential insights can grow exponentially.
And with more observations, obtained results would be much more robust and less subject to overfitting, hence making it easier to extrapolate them to a broader population.
The features and use cases described in this article, as well as the areas of improvement outlined above, lead to think that visual analytics tools could start to be seen as one of the upcoming revolutions in the clinical research and practice spaces.