November 14, 2015

Setting Expectations

Expect to:

  1. Get a sense of what is possible with R.
  2. Set up important frameworks around how to do data science (forewarning - I'm biased!)
  3. Be warned of potential minefields

Do Not Expect to:

  1. Completely understand all the code presented on your first try.
  2. Be exposed to all of the features of the mentioned packages.

The Data Scientist's Workflow

R Solutions

Important Operators

The magrittr package introduced the concept of a pipe - an operator that takes the data on the left and passes it to the function on the right.

Pre-Magrittr:
foo_foo <- little_bunny()

bop_on(scoop_up(hop_through(foo_foo, forest), field_mouse), head)

Post-Magrittr:
foo_foo <- little_bunny()

foo_foo %>% 
  hop_through(forest) %>% 
  scoop_up(field_mouse) %>% 
  bop_on(head)

Important Operators

Preparation: Getting Data into R


Warning:
Each package has slightly different undesirable "features" so always check your data frame to make sure you're using the best option for your data set!

Preparation: Data Manipulation

Especially with health care data (EHR) or secondary uses of data not collected primarily for future analysis, the vast majority of my time is spent performing data cleaning and data manipulation tasks.


Preparation: Data Manipulation

tbl_df()


filter()

select()

summarise()

mutate()

Preparation: Data Manipulation

dplyr also has support for grouping functions that alter the behavior of the previous verbs. To group a variabll simply add group_by() argument.

group_by() %>% summarise()


group_by() %>% mutate()


Preparation: Data Manipulation

Primary Functions:

gather()


spread()


separate()




unite()


Preparation: Data Manipulation


Both stringr and lubridate help ease string and date functions respectively.



Although the string functions in R are actually not bad, some of them behave in unexpected ways. Stringr tries to fix these to make it easier and intuitive.
Dates in R can be a mess, lubridate makes it easy to manipulate dates and do hard thing things like date ranges and comparing dates.

Please note that stringr uses ICU Regular expressions which means the can behave differently than expected based on R documentation. However you can simply use the regex() function to change back to perl regular expressions.

R Solutions

Analysis: Model Tidying

mtcars %>% 
  lm(formula = mpg ~ cyl + disp) %>% 
  summary()
## 
## Call:
## lm(formula = mpg ~ cyl + disp, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4213 -2.1722 -0.6362  1.1899  7.0516 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
## cyl         -1.58728    0.71184  -2.230   0.0337 *  
## disp        -0.02058    0.01026  -2.007   0.0542 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.055 on 29 degrees of freedom
## Multiple R-squared:  0.7596, Adjusted R-squared:  0.743 
## F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09

Analysis: Model Tidying



mtcars %>% 
  lm(formula = mpg ~ cyl + disp) %>% 
  tidy()

mtcars %>% 
  lm(formula = mpg ~ cyl + disp) %>% 
  glance()

Analysis: Graphs



ggplot(data, aes(x = F, y = a)) + geom_point()

Analysis: Graphs



ggplot(data, aes(x = F, y = a, color = F, size = A)) + geom_point()

R Solutions

Dissemination: Reports

Knitr takes an RMarkdown document - markdown formatted text with intersperced R code - and turns them into documents.


Drawing Drawing

Dissemination: Reports

Dissemination: Web Apps