Data science to understand and fight cancer

For higher resolution, interactive Tableau charts, read original article. In this version, only static screenshots are displayed. It does not give justice to Tableau.

Coming up with a topic for today's blog post was tough. My last blog about Wine got attention from wine entrepreneurs who had local wineries and collected sales data. Some of them contacted me and wanted to hire me to analyze their data. While it may sound like a nice side project, I could not accept an offer because I did not care about working with wine data all that much. To try it out was a fun experience but I was just not that interested to continue. So as I was debating topics for today's blog post, I decided to write about something I am truly passionate about. No wonder, it was healthcare. I work in healthcare, I understand healthcare better than any other topic and there is still a ton for me to learn in this field.

One of the most discussed topics in healthcare today is Cancer. Hundreds of companies work hard to invent new diagnostic methods, offer new treatment options and develop tests that can help monitor this nasty disease. Recently I was looking for latest clinical trials in breast cancer for my work project and found an awesome public clinical trial registry supported by NIH. The ClinicalTrials.gov currently lists about 190,000 studies conducted across the US and around the world.

Cancer is a horrific and aggressive disease. According to National Cancer Institute, in 2015 there will be 1.6M people diagnosed with some cancer type and only 1M of them will be alive by 2020. Cancer takes half a million lives per year and, despite all the efforts, we still cannot find a cure.

Saying goes, “In order to know your enemy, you must become your enemy”. So my goal for this blog post is to understand cancer a little better and share my learning experience with you.


Although ClinialTrials.gov releases data on their website, I found it a little hard to download files from their site. In order to get their data you have to navigate your way through trial categories, topics and once you see a list of trials in a specific category/topic you can download a .txt file with only a few fields summarizing trial name, recruitment status, interventions and whether results were released. Even if you are ok with this level of information it is still very time consuming to pick every category and topic you are interested in, download each file manually and generate one master database with all trial topics/categories. So I looked around and found Enigma.io. It is a pretty cool public data aggregator that allows users to find datasets, preview them online and do some basic manipulations on the data (filtering, sorting, descriptive stats) before users decide to download the file. It is free, so go ahead and check it out ☺

After doing a little search I found Clinical Trials data on Enigma.io website. If you are signed up for Enigma, you can check out the raw file here.

For me the biggest challenge with this file was in the fact that it was very poorly structured. While the dataset itself was comma separated, some fields contained multiple values (e.g. Condition field could list multiple conditions), other fields had to be transformed into multiple (i.e. Intervention field listed multiple intervention types and within each type there were multiple interventions). The only good thing about this table was that each clinical trial was uniquely represented. Since I was only focusing on clinical trials in cancer, I filtered the file by cancer types and started manipulating the data.

Data Manipulations

There were a few things I had to do to my file in order to make it usable:

  • Split fields that listed multiple values into multiple fields. One trial could be conducted for multiple:
    • conditions
    • cancer phases
    • age groups
  • Create multiple fields within fields:
    • Trial Interventions were all coded within one field. For example, if a trial used more then one intervention, the field would list intervention1: x, intervention 2: y, etc.).
    • The best way to deal with this situation was to create two new fields, one of which would list intervention type and the other would contain intervention method.
  • Build a relational database of clinical trials where I had:
    • Main table with unique trial identifiers - Clinical Trial Overview Table - with the following fields:
      • Trial ID
      • Trial Name
      • Trial Recruitment Status
      • Study Results
      • Participants Gender
      • Enrollment volume
      • Start Date
      • Completion Date
      • Update Date
    • A table listing every condition a trial was conducted for - Condition Table with the following fields:
      • Trial ID
        • Condition
    • A table listing study phases for each clinical trial -Condition Study Phase Table - with the following fields:
      • Trial ID
        • Study Phase
    • A table listing age groups of participants - Condition Age Group Table - with the following fields:
      • Trial ID
        • Participant Age Group
    • A table listing trial intervention - Condition Interventions Table - with the following fields:
      • Trial ID
        • Intervention Type
        • Intervention

Finally I loaded all tables into Tableau and started my relational database visualization.

Descriptive Analysis

Lets take a look at prevalent interventions across tumor types. From the first visualization below you may see how the top heatmap breaks down cancers by intervention type. Not surprisingly drug intervention was the most common across the board. Procedure interventions were also used for most cancer types. Traditional surgery was the most common procedure intervention (for resectable cancer types) followed by adjuvant therapy (i.e. chemotherapy) and blood stem cell transplantation which is basically a way to restore stem cells after high doses of chemotherapy destroys them.

Biological interventions were next most common. The majority of patients were either treated with monoclonal antibodies that help slow down cancerous cell growth and/or leukocyte growth factorused to prevent infections during chemotherapy.

Interestingly, Behavioral intervention (i.e. questionnaires) was mostly used in trials for breast, colon and skin cancers. Research 
suggests that women who had undergone mastectomy are more likely to have higher immunological response if they get psychological support than those who don't. Patients with colon cancer have to monitor their dietary behavior in order to have higher survival rates. Finally, skin cancer patients also need to monitor their sun exposure in order to have a better treatment response.

Radiation was most common in head & neck, lung and prostate tumors (tumors that have high risk of being spread throughout the body and develop metastases).

Device and Genetic interventions were least common in all cancer types.

For those interested in what interventions were used for each tumor type, I added a table on the left where you can specifycancer and intervention types and see specific interventions applied in selected condition.

Depending on the condition some trials may take significantly longer to complete than expected. The longest trials were conducted in rectal, breast, head, neck and blood cancer types with average trial duration ranging between 4.7 - 5 years. Brain cancer had the shortest trials with average duration of 2.8 years. Another interesting metrics to look at was the difference in expected trial completion date and its actual completion date. In other words, I wanted to see how late studies were in getting completed and how it varied by condition. Rectal, cervical and bladder cancers had the biggest difference in expected and actual completion dates (avg. difference b/w 1-1.2 years). Melanoma trials had the smallest 2.4 month difference.

Select a condition and an intervention type and see a breakdown of specific interventions for the selected tumor type

Demographics was another big factor in cancer clinical trials I wanted to explore. Top heatmap of the visualization presented below displays % of clinical trials conducted within an age group in each condition. For example, 76.39% of all trials conducted on children were in leukemia, 40.9% of adult trials were in breast cancer and 33.96% of trials among senior population were in lung cancer.

This heatmap confirms our understanding of cancer prevalence. Adult population is most often recruited for trials in breast (41%), cervical (8.8%) cancers and leukemia (27%). Children are more likely to be recruited for leukemia (76%) and kidney (7%) trials. Senior patients end up in trials for lung (34%), colon (18%) and breast cancers (17%). Although we see an overlap in age groups in leukemia (children + adults), for kids this type of cancer is the most prevalent. This may potentially be explained by risk factors children have disposition to when they are born, i.e. genetic risk factors such as inherited syndromes or immune system problems. Similarly, breast cancer is prevalent in both adults and senior patients, but adults seem to exhibit higher prevalence than seniors. This one is a little easier to explain: women in reproductive period are more likely to be diagnosed with breast cancer than those who had gone through a menopause.

On the bottom you may see a breakdown of clinical trials by condition and trial phase. As you can see, most cancer trials (40-60%) are conducted in phase II which is about safety of the new treatment and how well it works to treat a specific type of cancer. Brain cancer, however, stands out. There are more brain cancer clinical trials (32%) in Phase I than in any other phase. According to Cancer.net, Phase I clinical trials are used to show that a new treatment is safe for a small group of people and to find the best dose and schedule for future research of the drug or drug combination. This may explain why brain cancer trials are much shorter than trials in other tumor types and also may imply thatbrain cancer is the least developed area of research.

In the visualization below you also have an option to slice trial data by phase depending on patients' age group. When I introduced this filter I was hoping to see more leukemia trials conducted in Phase III for children. Phase III trials compare a new treatment or treatments with the standard treatment in a large group of people. This would have indicated that there are more advancement happening in childhood leukemia treatment. Similarly, I was hoping to see the same picture for breast cancer in adult population and for lung cancer in senior patients. Unfortunately, I could not pick on this pattern. In children only 17% of trials were in Phase III compared to 40% in Phase II. In adults 20% of trials were in Phase III compared to 46% in Phase II. Finally, in senior population 16% of trials were in Phase III compared to 50% in Phase II.

Click on an age group and filter clinical trials by condition and trial phase for the selected age group


From a preliminary look at the clinical trial data we can already learn something about cancer. Apart from drug interventions, some intervention methods were more prevalent in certain cancer types. For example, behavioral interventions were more common in cancers where patient's lifestyle could improve or worsen treatment outcomes (i.e. breast, colon and skin cancer); biological interventions were more prevalent in aggressive cancers with low survival rates at late stages when they are typically found (i.e. ovarian (17% 5-year survival rate in IV stage), pancreatic cancer (1% 5-year survival rate in IV stage)). Radiation was more common in cancers that typically spread to other parts of the body faster (i.e. lung, prostate, rectal, head & neck cancers).

Most clinical trials last somewhere between 3.7 and 5 years and take on average 8.5 months longer than initially planned. Brain cancer is the only one that stands out: on average brain cancer clinical trial lasts 2.8 years and ends only 4.8 months later than expected. But the reason seems to be not in the fact that brain cancer experimental treatments get patients into remission phase faster. On the opposite, these trials' primary goal is to prove that experimental drug is safe for the patients, whether it works or not is not where science of brain cancer is, yet.

By the way how patients get recruited for clinical trials we can observe general trends in diagnoses patterns. For example, prevalence of kids, adults and senior patients in trials by cancer type is consistent with breakdown of age groups by tumor types. For example, leukemia is the most common cancer in children, breast and cervical cancers are prevalent in adults, lung and colon cancers are more common in senior patients.

Finally, we saw that most clinical trials are conducted in phase II that assesses experimental treatment for safety and effectiveness, but does not compare to standard treatments out there (phase III). Conditions where more trials are in phase III are Anal, Bladder, Cervical, Colon and Rectal cancers but even there % of trials in phase III range between 20 - 24% which is still pretty low.

How to find actively recruiting clinical trials?

Cancer research is one of the most rapidly growing areas of clinical research in the US. According to the NIH Funding Estimates, about $5.6 billion dollars will be spent on cancer research in 2016. This means that in the upcoming years there will be more studies conducted, more treatment options available and better survival rates achieved. If cancer affected you, your family or people close to you and you are looking for new clinical trials, there is a clinical trial search tool for you below. Click on the cancer type in both filters in the middle of the visualization and find a breakdown of trials by recruitment status. If you or someone you care for are looking for an actively recruiting clinical trial, take a look at the list on the left, click on a trial ID and you will be directed to the site of an actively recruiting clinical trial in cancer type you are interested in.

Views: 3591

Tags: cancer, data, healthcare, research, science


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service