Load the dataset

R provides a set of datasets, one is on chick weights, specifically from a weight gain experiment testing different diets.

This worked example showcases some features of the Tidyvers, in particular:

data("ChickWeight")

The “Vanilla R” way

Let see how we usually work with R:

myWeightData <- ChickWeight

# Add column
myWeightData$randomNumber <- rnorm(nrow(myWeightData))

# Bin the new column
myWeightData$bins <- cut(myWeightData$randomNumber, breaks = 3, labels = c("Small", "Medium", "Large"))

head(myWeightData)
##   weight Time Chick Diet randomNumber   bins
## 1     42    0     1    1   0.55060789 Medium
## 2     51    2     1    1  -0.79899546  Small
## 3     59    4     1    1   0.02919268 Medium
## 4     64    6     1    1   0.38677273 Medium
## 5     76    8     1    1  -1.11642777  Small
## 6     93   10     1    1   0.94456266 Medium

The “Tidyverse” way

First we need to load our tidiverse library

library(tidyverse)

Now we can reproduce the dataframe using the %>% macro (pipe).

myTidyWeight <- ChickWeight %>%
  add_column(randomNumber = rnorm(nrow(ChickWeight))) %>%
  mutate(bucket = cut_number(randomNumber, n = 3, labels = c("Small", "Medium", "Large")))

How does it look like?

head(myTidyWeight)
##   weight Time Chick Diet randomNumber bucket
## 1     42    0     1    1   -0.8122840  Small
## 2     51    2     1    1   -1.1642838  Small
## 3     59    4     1    1   -1.3037687  Small
## 4     64    6     1    1    0.5271401  Large
## 5     76    8     1    1   -0.9956681  Small
## 6     93   10     1    1   -1.1930407  Small

Piping hot

We can use some verbs from the tidyverse to manipulate our dataset, after converting it to Tibble.

tibbleChicks <- as_tibble(myTidyWeight)
tibbleChicks %>%
  filter(Time == 0 & Diet == 2) %>%
  arrange(weight, desc(randomNumber) )
## # A tibble: 10 x 6
##    weight  Time Chick Diet  randomNumber bucket
##     <dbl> <dbl> <ord> <fct>        <dbl> <fct> 
##  1     39     0 27    2           0.720  Large 
##  2     39     0 28    2           0.444  Medium
##  3     39     0 29    2          -1.31   Small 
##  4     40     0 21    2           1.76   Large 
##  5     40     0 25    2          -0.359  Medium
##  6     41     0 22    2           0.608  Large 
##  7     42     0 26    2           0.622  Large 
##  8     42     0 30    2           0.559  Large 
##  9     42     0 24    2           0.0306 Medium
## 10     43     0 23    2          -1.06   Small

Graphical Exploration

We will start by looking the raw data graphically using the ggplot2 package using some relatively simple plots. At this stage don’t worry too much about the details of the commands just try to build your own understanding.

ggplot(ChickWeight) +
  aes(Time, weight) +
  geom_point()

From the above scatter plot we can see that in general chick weights (vertical axis) increase over time (horizontal axis).

What’s the effect of the diet? We can colour the dots according to the Diet and add some “jitter”, to reduce the overlap in dense areas.

ggplot(ChickWeight) +
  aes(Time, weight, colour = Diet ) +
  geom_point() +
  geom_jitter()

Overlapping is not a major issue here but this looks like four hives of bees spreading out so still not easy to see what the effect of diet. facet_wrap() allows to split our data into panels. And since the split is done on the Diet column, we can hide the legend.

ggplot(ChickWeight) +
  aes(Time, weight, color = Diet) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~Diet)

If we want a line graph, we need to change the geometry. Since we are interested in the growth of each individual, we need to change the coloring too.

ggplot(ChickWeight) +
  aes(Time, weight, color = Chick) +
  geom_line(show.legend = FALSE) +
  facet_wrap(~Diet)

This is better but we will remove the legend.

ggplot(ChickWeight, aes(Time, weight, group = Chick)) + geom_line() + 
  facet_wrap(~Diet)

Oh… but now we’ve lost the colours.

ggplot(ChickWeight, aes(Time, weight, group = Chick, colour=Chick)) + geom_line() + 
  facet_wrap(~Diet)

Perhaps the aesthetics (aes) need to be in the geom_line part.

ggplot(ChickWeight) +
  aes(Time, weight) + 
  geom_line(aes(group = Chick, colour=Chick)) + 
  facet_wrap(~Diet)

## Combining manipulation and plotting

Here we collapse the tibble by Diet and Time, adding the mean and standard error.

# We'll create a dataframe that summarizes (i.e., mean and standard error) the data by both diet group and time point

ChickSummary <-  ChickWeight  %>%
  group_by(Diet, Time) %>%
  summarise(Mean = mean(weight), StdErr = sd(weight)/(sqrt(n()))) 

# Let's take a peek at the data
head(ChickSummary)

Dodging preserves the vertical position of an geom while adjusting the horizontal position (position_dodge())

ggplot(ChickSummary) +
  aes(x=Diet, y=Mean, fill=as.factor(Time) ) + 
  geom_bar(stat="identity", position=position_dodge()) +
  scale_fill_brewer(palette="Paired") +
  ggtitle("Chick Weight by Group") + 
  guides(fill=guide_legend(title="Time points"))

Let’s filter the first and last time points:

# We'll create a dataframe that summarizes (i.e., mean and standard error) the data by both diet group and time point

ChickExtremeSummary <-  filter(ChickWeight, (Time == 0 ) | (Time == 21) ) %>%
  group_by(Diet, Time) %>%
  summarise(Mean = mean(weight),se = sd(weight)/(sqrt(n()))) 

# Let's take a peek at the data
head(ChickExtremeSummary)

And plot the result again, this time adding the error bars:

ggplot(ChickExtremeSummary) +
  aes(x=Diet, y=Mean, fill=as.factor(Time) ) + 
  geom_bar(stat="identity", position=position_dodge()) +
  scale_fill_brewer(palette="Paired") +
  geom_errorbar(aes(ymin=Mean-se, ymax=Mean+se), width=.2,
                 position=position_dodge(.9)) +
  ggtitle("Chick Weight by Group") +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(fill=guide_legend(title="Time points"))

To continue your exploration of the tidyverse you can use some resources from DataCamp that use the Gap Minder data set. We knitted the first two here: