Data science plays a pivotal role in the pharmaceutical industry, from revolutionizing drug discovery and development to delivering safer and more personalized healthcare solutions. With the vast amounts of data generated in the healthcare system — from patient records to clinical trials — data-driven insights can help identify potential drug candidates more efficiently, predict patient responses, optimize treatment regimens, and enhance medical research. 

Improving patient outcomes is always the goal, but the integration of data science in the pharmaceutical sector is not without challenges — particularly concerning the handling of sensitive and confidential data. Striking the right balance between harnessing the potential of data and ensuring patient privacy requires robust data governance and utmost regulation compliance. In this regard, data science can also prove to be instrumental in supporting advanced encryption techniques, anonymizing data, and developing secure platforms. As new solutions are integrated, a close collaboration between pharmaceutical companies, data scientists, doctors, caregivers, and government is crucial to ensuring the responsible and ethical use of life changing data.

These opportunities and challenges inspired Shruti Kaushal (M.S. Data Science ’23) to build her career in pharmaceutical data science. With a strong background in biology and mathematics, Kaushal did an internship at Vertex Pharmaceuticals while enrolled in the MSDS program. In her current position as a Senior Data Scientist at AbbVie, a medical research and development company, her work spans predicting when a trial will finish enrolling patients to identifying any anomalies within trial data before the results are submitted to the U.S. Food and Drug Administration (FDA). 

At the heart of her work is helping to shorten the development lifecycles of new drugs and medical devices, ultimately, helping patients receive the treatments they need faster. She cites that the MSDS program’s focus on real world approaches helped her better understand and work towards this important social impact. DSI caught up with Kaushal to talk about where data science in healthcare is heading. 

This alumni interview series is part of our year-long celebration to mark the 10 Year Anniversary of the M.S. in Data Science Program. Follow more updates with #MSDS10.

Could you highlight the key trends, challenges, and opportunities that data scientists in this field should focus on?

It’s important to understand that the pharmaceutical industry is not as technically advanced as you’d expect a finance or tech company to be. This is due to the sheer amount of pressure and accountability it has and should have. However, it is catching up — pharma companies around the globe are moving towards aggregating all in-house datasets into an internal cloud for ease of access by cross-functional teams, which wasn’t possible until recently due to compliance policies and laws. This means having sequencing data, electronic health records (EHR), data from clinical trials, and commercial data all in one place for teams to use. 

The biggest challenge, however, still remains: helping stakeholders build trust in data science techniques enough for them to adopt them. The next few years will see pharma adopt natural language processing (NLP) for mining their HCP and trial data; network analysis to develop personalized medication profiles; causal inference to optimize clinical trials; and machine learning in ways we can only imagine right now but will soon become a reality!

How are you addressing these areas in your current role?

The type of data science challenges in pharma are extremely diverse and highly dependent on the stage of a drug’s development life cycle.

The impacts of shortening a drug's life cycle—even by a couple of years — by using data science as an effective and responsible tool is groundbreaking for patients and society. I would like my work to be a part of that effort.

One such example from clinical trials is accurately predicting when a trial will finish enrolling patients. This problem has many layers, as we have to consider the therapeutic area, number of contributing sites, performance of sites, patient population, and healthcare personnel (HCP) network to even begin modeling; not to mention frequent delays due to regulations. Most time series algorithms fail at predicting the number of health screens or enrollments due to the nature of the series. 

Healthcare data is much more sparse and intermittent compared to many other types of data (like bank transactions) and so traditional algorithms often fail. I am currently working on developing an ensemble model that is able to pick up therapeutic level nuances before going into regression using random forests and various forecasting algorithms for intermittent time series. 

Another example is much later in the life cycle: before findings of a completed trial are submitted to the FDA, data from every site is evaluated for quality and numerous other things like drug efficacy. One of the biggest concerns any pharma company has is data fraud in a clinical trial, which leads to dismissal of findings. A project I’ve been working on for the past few months is aimed at finding anomalies in clinical trials using all kinds of data recorded during the trial, such as lab test results, vital signs, medical history, adverse events, etc. The algorithm uses statistical models to flag sites whose distribution differs from that of all other sites, while accounting for randomness that is inherent from enrolling different patients. 

What lessons from the MSDS program are you currently using in your career? What are some practical applications that you are using every day?

I loved the MSDS program because it is so well-planned. Since the focus of the MSDS program is on developing a strong foundation with in-depth knowledge of different algorithms and data management, I could jump straight to solving real world problems, such as in my internships with Vertex Pharmaceuticals. I learned a lot from courses like Graphical Models for Complex Health Data and Applied Machine Learning and Statistical Inference that I use daily, like random forests, greedy equivalence search (GES) algorithms, and survival models (cox-PH models). 

One of the biggest lessons I learned from the program is the ability to divide any data science problem or project into smaller nuanced questions and answer them in a structured manner using exploratory data analysis and intelligent visualizations before moving on to modeling.

What non-technical lessons did you find particularly valuable and how have they proven beneficial in your professional life?

I remember how John Hyde, DSI Assistant Director of Career Development and Alumni Services, used to stress on creating and owning my personal brand in every conversation we had, and I’m so thankful that he did that. I usually took his advice as a pep talk, but it was one of my most valuable takeaways. Working with non-technical stakeholders can be challenging, especially when it comes to healthcare. It takes vulnerability and confidence to build professional relationships when the stakes are so high. As Columbia graduates, exceptional work quality is expected, but what gives a person an edge is that human touch, as cliché as it may sound. 

By bringing this to my work, I have been able to break down barriers that stakeholders have in terms of sharing important information and data. Hard-work, accountability, and humility is the  foundation of my approach as a data scientist. The reputation I have built with clinical trials teams as a data science expert who is upfront and solution-driven has given me the opportunity to expand my support to multiple programs, each with 3+ ongoing trials.

During your time at Columbia, you served as a Teaching Assistant for the Data Analysis for the Social Sciences course. Could you share your favorite learnings from this experience, and has it influenced your perspective on the societal impact of data science?

My favorite learnings from the course were from the final projects that students submitted. It was intriguing to see so many different perspectives to an eclectic mix of problems with potential societal impact. It strengthened my belief that an algorithm is as strong as the person yielding it. I believe that every data scientist should focus on accountability and gaining contextual knowledge in order to wield this power responsibly. This couldn’t be said enough for people working in the pharma industry which is highly regulated solely because of the impact it has on society and patients’ lives.