November 14, 2015

Setting Expectations

Expect to:

  1. Get a sense of what is possible with R.
  2. Set up important frameworks around how to do data science (forewarning - I'm biased!)
  3. Be warned of potential minefields

Do Not Expect to:

  1. Completely understand all the code presented on your first try.
  2. Be exposed to all of the features of the mentioned packages.

The Data Scientist's Workflow

R Solutions

Important Operators

The magrittr package introduced the concept of a pipe - an operator that takes the data on the left and passes it to the function on the right.

Pre-Magrittr:
foo_foo <- little_bunny()

bop_on(scoop_up(hop_through(foo_foo, forest), field_mouse), head)

Post-Magrittr:
foo_foo <- little_bunny()

foo_foo %>% 
  hop_through(forest) %>% 
  scoop_up(field_mouse) %>% 
  bop_on(head)

Important Operators

Preparation: Getting Data into R


Warning:
Each package has slightly different undesirable "features" so always check your data frame to make sure you're using the best option for your data set!

Preparation: Data Manipulation

Especially with health care data (EHR) or secondary uses of data not collected primarily for future analysis, the vast majority of my time is spent performing data cleaning and data manipulation tasks.


Preparation: Data Manipulation

tbl_df()


filter()

select()

summarise()

mutate()

Preparation: Data Manipulation

dplyr also has support for grouping functions that alter the behavior of the previous verbs. To group a variabll simply add group_by() argument.

group_by() %>% summarise()


group_by() %>% mutate()


Preparation: Data Manipulation

Primary Functions:

gather()


spread()