This report is a portion of the AMIA 2015 Tutorial on Using R for Healthcare Data Science. All code and data available at my GitHub page.

Introduction

This report will walk you through the data scientist’s workflow and how recent R packages make data science easier and more intuitive. First, let’s start with a couple of disclaimers:

  1. This tutorial is to give you a sense of what is possible with R and motivate you to learn more - not to teach you every detail of code or packages available.
  2. We will not spend extensive time on data modeling. This tutorial is intended to work through data janitor tasks and reporting - in my experience some of the most time consuming tasks of data science.

To illustrate how packages released over the past few years have made these tasks easier we will walk through an entire analysis plan using data published by the International Warfarin Pharmacogenomics Consortium available on the PharmGKB website.

Motivation

Starting a new patient on warfarin can be a complicated process as many providers select a starting warfarin dose based on complex clinical algorithms. We know that genetics play a role in final warfarin dose and many groups have started to include genomic markers in their algorithms used to advise starting warfarin dose.

Our goal is to ultimately create a web app that a provider could use to input clinical and genetic data about a patient and get back a recommended starting dose of warfarin. One group that has already completed this task is the IWPC (International Warfarin Pharmacogenomics Consortium).

The main data set for the IWPC study is available on PharmGKB.

We will download the data from the original paper, and take it through the steps of the data scientist’s workflow - preparing, analyzing, and reporting. Ultimately we will create an interactive web app for our model much like the one produced by the IWPC.

The Data Scientist’s Workflow

Taking a cue from David Rob1, Data Scientist at Stack Overflow, and Philip Guo2, Assistant Professor of Computer Science at University of Rochester, here is my view of the primary computational data science workflow:

The preparation phase of the workflow involves:

  1. Getting data into R
  2. Data Tidying
    1. following principles of tidy data3
    2. ensuring correct data types
  3. Data Manipulation to prepare for analysis
    1. adjusting date/times
    2. parsing strings
    3. creating/combining variables

The analysis phase consists of:

  1. Data Modeling (e.g., statistics, machine learning etc.)
  2. Model Tidying and Manipulation
    1. turn R model objects into clean tables
    2. compare different models
  3. Data Visualization
    1. graphs and tables of data
    2. graphs and table of model results

Finally, the dissemination phase to share the results of their work:

  1. Writing Reports (e.g., technical reports that show analysis steps - great for sharing with analysts, and reproducible research)
  2. Publishing (either as formatted journal articles or reports for non-technical readers)
  3. Web Applications (interactive visualization tools)

Over the past few years the growth in tools aiding these steps has been phenomenal. We will cover each of these as we move through the workflow steps, but here is a summary of the different packages I’ve found useful for these steps:

Preparation Phase

Getting Data into R

Let’s load up our IWPC data! We will be using a slightly modified form of the main data set, that I have manually turned into a tab delimited text file. Although there are a number of libraries to read in excel files, the non-standard column names in the data set make it easier to work with a tsv. We are going to use read.delim() as opposed to readr’s read_tsv() for two reasons:

  1. The non-standard column names (contains spaces, returns, and symbols)
  2. Changing data types. This data is from a consortium and contains different types of data in each column based on the study site.

This last reason is the deal breaker for readr. Readr interpolates the variable type (column, date, number, etc.) based on the first 100 rows or via manual specification. Given the large number of columns (22) this becomes annoying at best. However, since we can’t take advantage of readr automatically making a tbl_df() object, so we will have to do so manually.


iwpc_data <- read.delim(file = "iwpc_data_7_3_09_revised3.txt") %>% tbl_df()


Let’s take a look at the type of data we are working with.



It is also important in R to know what data types are being used, as this can affect the behaviour of some functions.

iwpc_data %>% 
  map(~class(.x)) %>% 
  t() %>% 
  as.data.frame() %>% 
  mutate(Variable_Name = rownames(.), Variable_Type = V1) %>% 
  select(Variable_Name, Variable_Type) %>% 
  datatable(rownames = FALSE, options = list(paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')


Tidying your Data

Looking at our data above we see there are a number of problems:

  1. Column names are not standard.
  2. Estimated Target INR an excel adverse reaction.

Let’s first fix the column names - which is easy to do with dplyr.

iwpc_data %<>% 
  rename(subject_id = PharmGKB.Subject.ID,
         sample_id = PharmGKB.Sample.ID,
         project_site = Project.Site,
         gender = Gender,
         race_reported = Race..Reported.,
         race_omb = Race..OMB.,
         ethnicity_reported = Ethnicity..Reported.,
         ethnicitiy_omb = Ethnicity..OMB.,
         age = Age,
         height = Height..cm.,
         weight = Weight..kg.,
         indication = Indication.for.Warfarin.Treatment,
         comorbidities = Comorbidities,
         medications = Medications,
         target_inr = Target.INR,
         target_inr_estimated = Estimated.Target.INR.Range.Based.on.Indication,
         reached_stable_dose = Subject.Reached.Stable.Dose.of.Warfarin,
         therapeutic_warfarin_dose = Therapeutic.Dose.of.Warfarin,
         inr_on_warfarin = INR.on.Reported.Therapeutic.Dose.of.Warfarin,
         smoker = Current.Smoker,
         cyp2c9_consensus = CYP2C9.consensus,
         vkorc1_1639_consensus = VKORC1..1639.consensus)

To fix the Target INR problem, we will need to use some basic string functions from stringr. First though let’s get a good look at the extent of the problem by looking at all the distinct values in this field.

iwpc_data %>% 
  count(target_inr_estimated) %>% 
  datatable(rownames = FALSE, colnames = c("Target INR", "N"),  options = list(order = list(1, "dsc"),  paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')


From this it looks like we have two values that need to be changed: 3-Feb and 4-Mar. These correspond to the ranges 2-3 and 3-4 respectively. We can use mutate the make this change, but it will require a bit of effort.

As of the writing of this tutorial there is not a conditional mutate function in dplyr (though it’s being discussed by the development team). Because of that, we must make our own conditionals with ifelse().

iwpc_data %<>% 
  mutate(target_inr_estimated = as.character(target_inr_estimated)) %>% 
  mutate(target_inr_estimated = ifelse(target_inr_estimated == "3-Feb",
                                       yes = "2-3", 
                                       no = ifelse(target_inr_estimated == "4-Mar",
                                                   yes = "3-4", 
                                                   no = target_inr_estimated)))

And then checking our work:

iwpc_data %>% 
  count(target_inr_estimated) %>% 
  datatable(rownames = FALSE, colnames = c("Target INR", "N"),  options = list(order = list(1, "dsc"),  paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')

Manipulating Data

In this case since we are trying to replicate an existing analysis plan we have the easy job of just trying to massage our data into the variables used the previous model. In this case they used a lot of dummy coded variables in their analysis.

They Used:

  • Age in decades = 1 for 10-19, etc…
  • VKORC1 G/A = 1 if heterozygous
  • VKORC1 A/A = 1 if homozygous for A
  • VKORC1 genotype unknown = 1
  • CYP2C9 *1/*2 = 1 if *1/*2
  • CYP2C9 *1/*3 = 1 if *1/*3
  • CYP2C9 *2/*2 = 1 if homozygous *2
  • CYP2C9 *2/*3 = 1 if *2/*3
  • CYP2C9 *3/*3 = 1 if homozygous *3
  • CYP2C9 genotype unknown = 1
  • Asian Race = 1
  • Black/African American = 1
  • Missing or Mixed race = 1
  • Amiodarone status = 1
  • Enzyme inducer status = 1 if any of: rifampin, carbamazepine, phenytoin, rifampicin

We Have:

  • Age: 10-19, 20-29, 30-39 etc.
  • VKORC1: A/A, A/G, G/G
  • CYP2C9: combinations of: *1, *2, *3, *5, *6, *8, *11, etc.
  • Race: Asian, Black or African America, White, Other
  • Medications: character list of medications, lack of medications, etc.

Based on this it looks like we will need to:

  1. Adjust age to extract decade only
  2. Dummy Code VKORC1 Genotype
  3. Dummy Code CYP2C9 Genotype
  4. Dummy Code Race
  5. Extract from medications:
    1. Amiodarone
    2. Enzyme Inducers - specifically: rifampin, carbamazepine, phenytoin, rifampicin

Age Processing

Again, always look at your data to make sure it follows the formatting you’re expecting.

iwpc_data %>% 
  count(age) %>% 
  datatable(rownames = FALSE, colnames = c("Age", "N"),  options = list(order = list(1, "dsc"),  paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')


Ok let’s fix the excel nightmare:

iwpc_data %<>% 
  mutate(age = as.character(age)) %>% 
  mutate(age = ifelse(age == "19-Oct", 
                      yes = "10 - 19", no = age))

Confirm it’s fixed:

Now to the fun stuff. In this case we really only need the first number of the string. This is not the safest transformation, but it makes the processing easier and code less complex than coding ifelse() statements for each case. I always like to look at the impact of my work before confirming the change.

iwpc_data %>% 
  count(age, 
        substr(age,1,1), 
        as.numeric(substr(age,1,1))) %>% 
  datatable(rownames = FALSE, colnames = c("Age", "Substring of Age", "Numeric Version of Substring", "N"),  options = list(order = list(3, "dsc"),  paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')

iwpc_data %<>% 
  mutate(age_decades = as.numeric(substr(age,1,1)))


#### Dummy Code VKORC1 Genotypes

Again first look at the genotypes.

iwpc_data %>% 
  count(vkorc1_1639_consensus) %>% 
  datatable(rownames = FALSE, colnames = c("VKORC1 Genotype", "N"),  options = list(order = list(1, "dsc"),  paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')


We can follow the same logic as before with the ifelse() statements.

iwpc_data %>% 
  mutate(vkorc1_1639_ag = ifelse(str_detect(vkorc1_1639_consensus,"A/G"),
                                 yes = 1, no = 0),
         vkorc1_1639_aa = ifelse(str_detect(vkorc1_1639_consensus, "A/A"),
                                 yes = 1,no = 0),
         vkorc1_1639_unknown = ifelse(is.na(vkorc1_1639_consensus),
                                      yes = 1,no = 0)) %>% 
  count(vkorc1_1639_consensus,vkorc1_1639_ag,vkorc1_1639_aa,vkorc1_1639_unknown) %>% 
  datatable(colnames = c("VKORC1 1639","VKORC1 A/G","VKORC1 A/A","VKORC1 Unknown","N"), rownames = FALSE, options = list(pageLength = 12, bFilter = FALSE, info = FALSE, paging = FALSE))


Wait! I thought we set the values for VKORC1 A/G and VKORC1 A/A to 0 if it didn’t match the regular expression! Why are those fields blank when the VKORC1 genotype was missing?

Well that has to do with how R handles NA values. If you don’t know what the value for that field is, R has no idea if the regex matches. Hypothetically that person could match that genotype we just don’t know. Because of this we have to adjust our approach. Now instead of defining you by what you are, we’ll group you by what you are not: If you are not NA and you are not equal to A/G we set you to 0 otherwise (e.g., you equal A/G) we set you to one.

iwpc_data %<>% 
  mutate(vkorc1_1639_ag = ifelse(is.na(vkorc1_1639_consensus) | 
                                   !str_detect(vkorc1_1639_consensus,"A/G"),
                                 yes = 0,  no = 1),
         vkorc1_1639_aa = ifelse(is.na(vkorc1_1639_consensus) | 
                                   !str_detect(vkorc1_1639_consensus, "A/A"),
                                 yes = 0, no = 1),
         vkorc1_1639_unknown = ifelse(is.na(vkorc1_1639_consensus),
                                      yes = 1, no = 0))

And checking our work:


Dummy Code CYP2C9 Genotype

Based on the same logic as VKORC1, we will use the exclusionary strategy for dummy coding CYP2C9 genotype. Note that here we have a lot more genotypes than we use in the model.

iwpc_data %>% 
  count(cyp2c9_consensus) %>% 
  datatable(rownames = FALSE, colnames = c("CYP2C9 Genotype", "N"),  options = list(order = list(1, "dsc"),  paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')


iwpc_data %<>% 
  mutate(cyp2c9_1_2 = ifelse(is.na(cyp2c9_consensus) |
                               !str_detect(cyp2c9_consensus,"\\*1/\\*2"),
                             yes = 0, no = 1),
         cyp2c9_1_3 = ifelse(is.na(cyp2c9_consensus) |
                               !str_detect(cyp2c9_consensus,"\\*1/\\*3"),
                             yes = 0, no = 1),
         cyp2c9_2_2 = ifelse(is.na(cyp2c9_consensus) |
                               !str_detect(cyp2c9_consensus,"\\*2/\\*2"),
                             yes = 0, no = 1),
         cyp2c9_2_3 = ifelse(is.na(cyp2c9_consensus) |
                               !str_detect(cyp2c9_consensus,"\\*2/\\*3"),
                             yes = 0, no = 1),
         cyp2c9_3_3 = ifelse(is.na(cyp2c9_consensus) |
                               !str_detect(cyp2c9_consensus,"\\*3/\\*3"),
                             yes = 0, no = 1),
         cyp2c9_unknown = ifelse(is.na(cyp2c9_consensus),
                                 yes = 1,no = 0))

Checking our work:

Dummy Code Race

iwpc_data %>% 
  count(race_omb) %>% 
  datatable(rownames = FALSE, colnames = c("Race", "N"),  options = list(order = list(1, "dsc"),  paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')


This is clean and easy to fix with mutate and ifelse().

iwpc_data %<>% 
  mutate(asian = ifelse(str_detect(race_omb, "Asian"),
                        yes = 1,
                        no = 0),
         african_american = ifelse(str_detect(race_omb, "Black or African American"),
                                   yes = 1, 
                                   no = 0),
         missing_or_mixed_race = ifelse(str_detect(race_omb, "Unknown"),
                                        yes = 1,
                                        no = 0))

Checking our work:

iwpc_data %>% 
  count(race_omb, asian, african_american, missing_or_mixed_race) %>% 
  datatable(colnames = c("Race OMB","Asian","African American","Missing/Mixed Race","N"), rownames = FALSE, options = list(pageLength = 12, bFilter = FALSE, info = FALSE, paging = FALSE))


Process Medications

The medications column is an ugly beast. Remember how this data came from multiple studies? Well that means this field has a lot going on. Some sites asked about specific drugs and only include if they took those drugs or not. Others are from EHR linked databases and they simply exported the medication list of the patient (don’t even begin to ask which medication list - timepoint, copy/paste hold overs etc.!).

Let’s take a quick look at the format to see what we’re getting ourselves into.

iwpc_data %>% 
  count(medications) %>% 
  datatable(rownames = FALSE, colnames = c("Medications", "N"),  options = list(order = list(1, "dsc"),  paging = FALSE, bFilter = FALSE, info = FALSE, scrollY = '300px'))


Amiodarone

Let’s filter the medications to look at those matching amiodarone

iwpc_data %>% 
  filter(str_detect(medications, "amiodarone")) %>% 
  count(medications) %>% 
  datatable(rownames = FALSE, colnames = c("Medications", "N"),  options = list(order = list(1, "dsc"),  paging = FALSE, bFilter = FALSE, info = FALSE, scrollY = '300px'))

From this look we know that there are a lot of qualifiers and other textual clues. A really great trick for regex development is to use str_extract to get just the snippet of text mentioning amiodarone. Before we get there though, always make sure your more complex regex is grabbing the same number of rows as your general regex.

iwpc_data %>% filter(str_detect(medications, "amiodarone")) %>% count()
## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1  1160
iwpc_data %>% filter(str_detect(medications, "(^|;)[a-z ]*amiodarone[a-z ]*($|;)")) %>% count()
## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1  1160

Then look at the text snippet:

iwpc_data %>% 
  mutate(amiodarone_text = str_extract(medications, "(^|;)[a-z ]*amiodarone[a-z ]*($|;)")) %>% 
  count(amiodarone_text) %>% 
  datatable(rownames = FALSE, colnames = c("Amiodarone_Snippet", "N"),  options = list(order = list(1, "dsc"),  paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')

Now let’s write a regex to extract amiodarone only where it does not say, “not” or “no” amiodarone.

iwpc_data %>% 
  mutate(amiodarone_text = str_extract(medications, "(^|;)[a-z ]*amiodarone[a-z ]*($|;)"), 
         amiodarone_bool = ifelse( !is.na(medications) & str_detect(medications, "(?<!not? )amiodarone"), 
                                   yes = 1, 
                                   no = 0)) %>% 
  count(amiodarone_text, amiodarone_bool) %>% 
  datatable(rownames = FALSE, colnames = c("Amiodarone_Snippet", "Amiodarone_Detector", "N"),  options = list(order = list(1, "dsc"),  paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')

Great! Our regex works let’s implement it!

iwpc_data %<>% 
  mutate(amiodarone = ifelse( !is.na(medications) & str_detect(medications, "(?<!not? )amiodarone"), 
                              yes = 1, 
                              no = 0))
Enzyme Enducers

I will leave it an excercise for you to go through the regex development process with these drugs, but here are the finished regexes.

iwpc_data %<>% 
  mutate(carbamazepine = ifelse(!is.na(medications) & str_detect(medications,"(?<!not )carbamazepine"), yes = 1, no = 0),
         phenytoin = ifelse(!is.na(medications) & str_detect(medications,"(?<!not )phenytoin"),yes = 1,no = 0),
         rifampin = ifelse(!is.na(medications) & str_detect(medications,"(?<!not )rifampin"),yes = 1,no = 0),
         rifampicin = ifelse(!is.na(medications) & str_detect(medications,"(?<!not )rifampicin"),yes = 1,no = 0))

Remember though, we only need enzyme inducer status - i.e. did the patient take any of these drugs? Thankfully an easy way to create this variable is just take add up the four medications columns - if it is greater than 1 they took at least one of the medications!

iwpc_data %<>% 
  mutate(enzyme_inducers = ifelse((carbamazepine + phenytoin + rifampin + rifampicin) > 0, yes = 1, no = 0))

Checking our data:

Analysis Phase

Phew! Now that our data cleaning is finished, let’s get down to the fun - modeling our data!

I’m a big proponent of visualizing your data to make sure there’s nothing wonky happening. Let’s take a look at the outcome of interest: stable warfarin dose.

iwpc_data %>% 
  ggplot(aes(x = 1, y = therapeutic_warfarin_dose)) + geom_boxplot()

Oh, that’s not pretty. In fact it is common for warfarin dose to use the sqrt of the final dose.

iwpc_data %>% 
  ggplot(aes(x = 1, y = sqrt(therapeutic_warfarin_dose))) + geom_boxplot()

Let’s make a transformed outcome variable that is the square root of the therapeutic dose.

iwpc_data %<>% mutate(sqrt_warfarin_dose = sqrt(therapeutic_warfarin_dose))

Modeling

We can use the lm() function to run a linear model

iwpc_data %>% 
  lm(formula = sqrt_warfarin_dose ~ age_decades + vkorc1_1639_ag + vkorc1_1639_aa + vkorc1_1639_unknown + cyp2c9_1_2 + cyp2c9_1_3 + cyp2c9_2_2 + cyp2c9_2_3 + cyp2c9_3_3 + cyp2c9_unknown + asian + african_american + missing_or_mixed_race + amiodarone + enzyme_inducers)
## 
## Call:
## lm(formula = sqrt_warfarin_dose ~ age_decades + vkorc1_1639_ag + 
##     vkorc1_1639_aa + vkorc1_1639_unknown + cyp2c9_1_2 + cyp2c9_1_3 + 
##     cyp2c9_2_2 + cyp2c9_2_3 + cyp2c9_3_3 + cyp2c9_unknown + asian + 
##     african_american + missing_or_mixed_race + amiodarone + enzyme_inducers, 
##     data = .)
## 
## Coefficients:
##           (Intercept)            age_decades         vkorc1_1639_ag  
##               8.29530               -0.28463               -0.80405  
##        vkorc1_1639_aa    vkorc1_1639_unknown             cyp2c9_1_2  
##              -1.58282               -0.58581               -0.47377  
##            cyp2c9_1_3             cyp2c9_2_2             cyp2c9_2_3  
##              -0.90187               -1.09126               -1.87136  
##            cyp2c9_3_3         cyp2c9_unknown                  asian  
##              -2.50973               -0.38949               -0.67080  
##      african_american  missing_or_mixed_race             amiodarone  
##              -0.06755               -0.34883               -0.67218  
##       enzyme_inducers  
##               0.54552
iwpc_data %>% 
  lm(formula = sqrt_warfarin_dose ~ age_decades + vkorc1_1639_ag + vkorc1_1639_aa + vkorc1_1639_unknown + cyp2c9_1_2 + cyp2c9_1_3 + cyp2c9_2_2 + cyp2c9_2_3 + cyp2c9_3_3 + cyp2c9_unknown + asian + african_american + missing_or_mixed_race + amiodarone + enzyme_inducers) %>% 
  summary()
## 
## Call:
## lm(formula = sqrt_warfarin_dose ~ age_decades + vkorc1_1639_ag + 
##     vkorc1_1639_aa + vkorc1_1639_unknown + cyp2c9_1_2 + cyp2c9_1_3 + 
##     cyp2c9_2_2 + cyp2c9_2_3 + cyp2c9_3_3 + cyp2c9_unknown + asian + 
##     african_american + missing_or_mixed_race + amiodarone + enzyme_inducers, 
##     data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9270 -0.6938 -0.0181  0.6221 11.3955 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            8.29530    0.07208 115.085  < 2e-16 ***
## age_decades           -0.28463    0.01026 -27.752  < 2e-16 ***
## vkorc1_1639_ag        -0.80405    0.04491 -17.902  < 2e-16 ***
## vkorc1_1639_aa        -1.58282    0.05456 -29.013  < 2e-16 ***
## vkorc1_1639_unknown   -0.58581    0.04437 -13.204  < 2e-16 ***
## cyp2c9_1_2            -0.47377    0.04653 -10.182  < 2e-16 ***
## cyp2c9_1_3            -0.90187    0.05349 -16.862  < 2e-16 ***
## cyp2c9_2_2            -1.09126    0.15021  -7.265 4.26e-13 ***
## cyp2c9_2_3            -1.87136    0.13740 -13.620  < 2e-16 ***
## cyp2c9_3_3            -2.50973    0.24490 -10.248  < 2e-16 ***
## cyp2c9_unknown        -0.38949    0.10217  -3.812 0.000139 ***
## asian                 -0.67080    0.04474 -14.992  < 2e-16 ***
## african_american      -0.06755    0.05795  -1.166 0.243853    
## missing_or_mixed_race -0.34883    0.05453  -6.396 1.72e-10 ***
## amiodarone            -0.67218    0.08104  -8.294  < 2e-16 ***
## enzyme_inducers        0.54552    0.22410   2.434 0.014953 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.091 on 5473 degrees of freedom
##   (211 observations deleted due to missingness)
## Multiple R-squared:  0.4043, Adjusted R-squared:  0.4027 
## F-statistic: 247.7 on 15 and 5473 DF,  p-value: < 2.2e-16

Model Tidying and Manipulation

So if we were running multiple models or trying to do any processing of these models it would be really annoying to work with the above output. In fact R does not even store the p-value for each coefficient. Those are all calculated on fly when you view the summary of the model. Thankfully David Rob made a package called Broom that fixes this beautifully!

model <-  iwpc_data %>% lm(formula = sqrt_warfarin_dose ~ age_decades + vkorc1_1639_ag + vkorc1_1639_aa + vkorc1_1639_unknown + cyp2c9_1_2 + cyp2c9_1_3 + cyp2c9_2_2 + cyp2c9_2_3 + cyp2c9_3_3 + cyp2c9_unknown + asian + african_american + missing_or_mixed_race + amiodarone + enzyme_inducers)

warfarin_pharmacogenomic_model <- tidy(model)
warfarin_pharmacogenomic_model


We can also see the overall model fit information in a clean dataframe:

glance(model) %>% 
  datatable(options = list(paging = FALSE, bFilter = FALSE, info = FALSE, scrollX = TRUE, columnDefs = list(list(className = "dt-center", targets = c(0:11)))))

Data and Model Visualization

Broom makes it easy to make plots with the model details. Let’s use a forest plot to visualize the betas of the model.

warfarin_pharmacogenomic_model %>% 
  filter(term != "(Intercept)") %>% 
  mutate(variable = factor(term, levels = rev(c("age_decades", "asian","african_american","missing_or_mixed_race", "amiodaron","enzyme_inducers","vkorc1_1639_unkown","vkorc1_1639_ag","vkorc1_1639_aa","cyp2c9_unknown", "cyp2c9_1_2","cyp2c9_1_3","cyp2c9_2_2","cyp2c9_2_3","cyp2c9_3_3")))) %>% 
  ggplot() +
  geom_pointrange(aes(x = variable, y = estimate, ymin = estimate - std.error, ymax = estimate + std.error)) + 
  coord_flip()

We can also visualize the model fit easily with the package ggfortify.

autoplot(model)

Even if you are ultimately going to put a table with the summary statistics describing your dataset, I strongly recommend plotting your data at somepoint during your data QC and analysis. I have discovered a number of problems over the years that were only visible in plots, but looked fine otherwise.

Let’s make plots for each of the variables in our model and then use cowplot to put them into a single figure.

warfarin_dose <- iwpc_data %>% 
  ggplot(aes(x = therapeutic_warfarin_dose)) + geom_histogram() + 
  xlab("Weekly Warfarin Dose") + ylab("Count")

age <- iwpc_data %>% 
  ggplot(aes(x = age)) + geom_histogram() + 
  xlab("Age") + ylab("Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

race <- iwpc_data %>% 
  ggplot(aes(x = race_omb)) + geom_histogram() +
  xlab("Race") + ylab("Count") +
  scale_x_discrete(breaks = c("Asian", "Black or African American", "White", "Unknown"), labels = c("Asian", "Black", "White", "Unk."))

vkorc1 <- iwpc_data %>% 
  ggplot(aes(x = vkorc1_1639_consensus)) + geom_histogram() +
  xlab("VKORC1 Genotype") + ylab("Count")

cyp2c9 <- iwpc_data %>% 
  ggplot(aes(x = cyp2c9_consensus)) + geom_histogram() +
  xlab("CYP2C9 Genotype") + ylab("Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


ggdraw() +
  draw_plot(warfarin_dose, x = 0, y = 0.75, width = 1, height = 0.25) +
  draw_plot(age, x = 0, y = 0.5, width = 1, height = 0.25) +
  draw_plot(cyp2c9, x = 0, y = 0.25, width = 1, height = 0.25) +
  draw_plot(vkorc1, x = 0, y = 0, width = 0.5, height = 0.25) +
  draw_plot(race, x = 0.5, y = 0, width = 0.5, height = 0.25)

Dissemination Phase

Writing Reports and Publishing

RMarkdown is a great tool that offers a lot of flexibility for report generation. You can knit the same document into html, pdf, or a word doc. You can also even use R Markdown to make slides! It’s easy to change output document type in R Studio.

There are great resource documents online - including a 2 page cheatsheet and a complete reference guide that covers almost everything you need to know.

Deploying Online Tools

Let’s make a Shiny Application that takes inputs, uses the model weights we developed and gives out a predicted warfarin dose for the visitor:

You can run the Shiny App locally:

library(shiny)
runGist("cafba2c579b6922c4956")

Here is the code powering that Application. The details of this code are beyond the scope of the tutorial, but I highly recommend the Shiny tutorial RStudio has created.

shinyApp(
  
  ui = pageWithSidebar(
    headerPanel("Warfarin Pharmagenomic Dose Predictor"),
    sidebarPanel(selectInput("age", "Age in Decades:", choices = c("10-19","20-29","30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+")),
                 radioButtons("race", "Race:", choices = c( "Other or Unknown", "Asian", "African American", "White")),
                 radioButtons("vkorc1", "VKORC1 Genotype:", choices = c("Unknown", "G/G", "G/A", "A/A")),
                 radioButtons("cyp2c9", "CYP2C9 Genotype:", choices = c("Unknown", "*1/*2", "*1/*3", "*2/*2", "*2/*3", "*3/*3","Other")), 
                 checkboxInput('amiodarone', "Taking Amiodarone", FALSE),
                 checkboxInput("enzyme_inducers", "Taking an Enzyme Inducer (rifampin, carbamazepine, phenytoin or rifampicin)", FALSE),
                 actionButton("calc","Calculate")
                 ),
    mainPanel(
      strong(em("THIS IS A PROGRAMMING EXAMPLE ONLY - DO NOT USE FOR PATIENT CARE!")),
      br(),br(),
      strong(em("IF YOU HAVE QUESTIONS ABOUT YOUR WARFARIN DOSE, PLEASE CONTACT YOUR DOCTOR.")),
      br(),br(),
      p("You Selected: "),
      tableOutput("selectedvalues"),
      br(),br(),
      textOutput("warfarindose")
    )
  ),
  
  server = function(input, output){
    library(dplyr)
    library(tidyr)
    
    input_model <- eventReactive(input$calc, {
      data.frame(Age = input$age,
                 Race = input$race,
                 VKORC1 = input$vkorc1,
                 CYP2C9 = input$cyp2c9,
                 On_Amiodarone = input$amiodarone,
                 On_Enzyme_Inducers = input$enzyme_inducers)
    })
    
    output$selectedvalues <- renderTable({input_model() %>% 
        gather(key = Variable, value = Selection)})
    

    
    output$warfarindose <- renderText({
    warfarin_model <- structure(list(term = c("intercept", "age_decades", "vkorc1_1639_ag", "vkorc1_1639_aa", "vkorc1_1639_unknown", "cyp2c9_1_2", "cyp2c9_1_3", "cyp2c9_2_2", "cyp2c9_2_3", "cyp2c9_3_3", "cyp2c9_unknown", "asian", "african_american", "missing_or_mixed_race", "amiodarone", "enzyme_inducers"), 
                                     estimate = c(8.29529623468211, -0.284625991543081, -0.804050018868126, -1.58281937447931, -0.585810870878652, -0.473767057887023, -0.901866452197908, -1.09125866824621, -1.87135980567376, -2.50972717609401, -0.389492614878111, -0.670801757840009, -0.0675483297437883, -0.348827044586359, -0.672182905952411, 0.54552327241645), 
                                     std.error = c(0.072079981014331, 0.0102560287706372, 0.0449131398786503, 0.0545551821972256, 0.0443661947089462, 0.0465310858226987, 0.0534852345292653, 0.150214643924848, 0.137395891813837, 0.244898760990986, 0.102165232807463, 0.0447428773165641, 0.0579547790963726, 0.0545341398105267, 0.0810398751212533, 0.224099351950369), 
                                     statistic = c(115.084606265821, -27.7520664097549, -17.9023337277369, -29.0131809798961, -13.2039917942416, -10.1817322658718, -16.8619706005858, -7.26466235071039, -13.6202020378406, -10.2480190832259, -3.81237926224998, -14.9923696925874, -1.16553510852768, -6.39648935141037, -8.29447114703323, 2.43429205693226), 
                                     p.value = c(0, 1.03038776788502e-158, 1.064652478523e-69, 2.79523401421071e-172, 3.29186868870607e-39, 3.92172887720116e-24, 3.11993851496051e-62, 4.26388815028701e-13, 1.43668477234196e-41, 2.00539244089927e-24, 0.00013914798310481, 7.94475752194028e-50, 0.24385325010913, 1.72192992300359e-10, 1.35992655573032e-16, 0.0149527114251593)), 
                                .Names = c("term", "estimate", "std.error", "statistic", "p.value"), 
                                row.names = c(NA, -16L), 
                                class = "data.frame")
    
    predicted_dose <- input_model() %>% 
      mutate(intercept = 1,
             age_decades = as.numeric(substr(Age, 1, 1)),
             vkorc1_1639_ag = ifelse(VKORC1 == "G/A", 1, 0),
             vkorc1_1639_aa = ifelse(VKORC1 == "A/A", 1, 0),
             vkorc1_1639_unknown = ifelse(VKORC1 == "Unknown", 1, 0),
             cyp2c9_1_2 = ifelse(CYP2C9 == "*1/*2", 1, 0),
             cyp2c9_1_3 = ifelse(CYP2C9 == "*1/*3", 1, 0),
             cyp2c9_2_2 = ifelse(CYP2C9 == "*2/*2", 1, 0), 
             cyp2c9_2_3 = ifelse(CYP2C9 == "*2/*3", 1, 0), 
             cyp2c9_3_3 = ifelse(CYP2C9 == "*3/*3", 1, 0), 
             cyp2c9_unknown = ifelse(CYP2C9 == "Unknown", 1, 0), 
             asian = ifelse(Race == "Asian", 1, 0),
             african_american = ifelse(Race == "African American", 1, 0),
             missing_or_mixed_race = ifelse(Race == "Other or Unknown", 1, 0), 
             amiodarone = ifelse(On_Amiodarone, 1, 0),
             enzyme_inducers = ifelse(On_Enzyme_Inducers, 1, 0)) %>% 
      select(-c(Age:On_Enzyme_Inducers)) %>% 
      gather(key = term, value = value) %>% 
      mutate(term = as.character(term)) %>% 
      inner_join(warfarin_model) %>% 
      mutate(weighted = value * estimate) %>% 
      summarise(round(sum(weighted)^2))
    
    
    paste0("Based on the values entered, the predicted warfarin dose is: ", predicted_dose, "mg per week, or ~", round(predicted_dose/7),"mg per day.")
  })
  },
  
  options = list(height = 1000)
)

  1. http://varianceexplained.org/r/broom-slides/

  2. http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext

  3. http://www.jstatsoft.org/v59/i10/paper