This report is a portion of the AMIA 2015 Tutorial on Using R for Healthcare Data Science. All code and data available at my GitHub page.
This report will walk you through the data scientist’s workflow and how recent R packages make data science easier and more intuitive. First, let’s start with a couple of disclaimers:
To illustrate how packages released over the past few years have made these tasks easier we will walk through an entire analysis plan using data published by the International Warfarin Pharmacogenomics Consortium available on the PharmGKB website.
Starting a new patient on warfarin can be a complicated process as many providers select a starting warfarin dose based on complex clinical algorithms. We know that genetics play a role in final warfarin dose and many groups have started to include genomic markers in their algorithms used to advise starting warfarin dose.
Our goal is to ultimately create a web app that a provider could use to input clinical and genetic data about a patient and get back a recommended starting dose of warfarin. One group that has already completed this task is the IWPC (International Warfarin Pharmacogenomics Consortium).
The main data set for the IWPC study is available on PharmGKB.
We will download the data from the original paper, and take it through the steps of the data scientist’s workflow - preparing, analyzing, and reporting. Ultimately we will create an interactive web app for our model much like the one produced by the IWPC.
Taking a cue from David Rob1, Data Scientist at Stack Overflow, and Philip Guo2, Assistant Professor of Computer Science at University of Rochester, here is my view of the primary computational data science workflow:
The preparation phase of the workflow involves:
The analysis phase consists of:
Finally, the dissemination phase to share the results of their work:
Over the past few years the growth in tools aiding these steps has been phenomenal. We will cover each of these as we move through the workflow steps, but here is a summary of the different packages I’ve found useful for these steps: