This report is a portion of the AMIA 2015 Tutorial on Using R for Healthcare Data Science. All code and data available at my GitHub page.

Introduction

This report will walk you through the data scientist’s workflow and how recent R packages make data science easier and more intuitive. First, let’s start with a couple of disclaimers:

  1. This tutorial is to give you a sense of what is possible with R and motivate you to learn more - not to teach you every detail of code or packages available.
  2. We will not spend extensive time on data modeling. This tutorial is intended to work through data janitor tasks and reporting - in my experience some of the most time consuming tasks of data science.

To illustrate how packages released over the past few years have made these tasks easier we will walk through an entire analysis plan using data published by the International Warfarin Pharmacogenomics Consortium available on the PharmGKB website.

Motivation

Starting a new patient on warfarin can be a complicated process as many providers select a starting warfarin dose based on complex clinical algorithms. We know that genetics play a role in final warfarin dose and many groups have started to include genomic markers in their algorithms used to advise starting warfarin dose.

Our goal is to ultimately create a web app that a provider could use to input clinical and genetic data about a patient and get back a recommended starting dose of warfarin. One group that has already completed this task is the IWPC (International Warfarin Pharmacogenomics Consortium).

The main data set for the IWPC study is available on PharmGKB.

We will download the data from the original paper, and take it through the steps of the data scientist’s workflow - preparing, analyzing, and reporting. Ultimately we will create an interactive web app for our model much like the one produced by the IWPC.

The Data Scientist’s Workflow

Taking a cue from David Rob1, Data Scientist at Stack Overflow, and Philip Guo2, Assistant Professor of Computer Science at University of Rochester, here is my view of the primary computational data science workflow:

The preparation phase of the workflow involves:

  1. Getting data into R
  2. Data Tidying
    1. following principles of tidy data3
    2. ensuring correct data types
  3. Data Manipulation to prepare for analysis
    1. adjusting date/times
    2. parsing strings
    3. creating/combining variables

The analysis phase consists of:

  1. Data Modeling (e.g., statistics, machine learning etc.)
  2. Model Tidying and Manipulation
    1. turn R model objects into clean tables
    2. compare different models
  3. Data Visualization
    1. graphs and tables of data
    2. graphs and table of model results

Finally, the dissemination phase to share the results of their work:

  1. Writing Reports (e.g., technical reports that show analysis steps - great for sharing with analysts, and reproducible research)
  2. Publishing (either as formatted journal articles or reports for non-technical readers)
  3. Web Applications (interactive visualization tools)

Over the past few years the growth in tools aiding these steps has been phenomenal. We will cover each of these as we move through the workflow steps, but here is a summary of the different packages I’ve found useful for these steps: