R provides a set of datasets, one is on chick weights, specifically from a weight gain experiment testing different diets.
This worked example showcases some features of the Tidyvers, in particular:
%>%
)data("ChickWeight")
Let see how we usually work with R:
myWeightData <- ChickWeight
# Add column
myWeightData$randomNumber <- rnorm(nrow(myWeightData))
# Bin the new column
myWeightData$bins <- cut(myWeightData$randomNumber, breaks = 3, labels = c("Small", "Medium", "Large"))
head(myWeightData)
## weight Time Chick Diet randomNumber bins
## 1 42 0 1 1 0.55060789 Medium
## 2 51 2 1 1 -0.79899546 Small
## 3 59 4 1 1 0.02919268 Medium
## 4 64 6 1 1 0.38677273 Medium
## 5 76 8 1 1 -1.11642777 Small
## 6 93 10 1 1 0.94456266 Medium
First we need to load our tidiverse library
library(tidyverse)
Now we can reproduce the dataframe using the %>%
macro (pipe).
myTidyWeight <- ChickWeight %>%
add_column(randomNumber = rnorm(nrow(ChickWeight))) %>%
mutate(bucket = cut_number(randomNumber, n = 3, labels = c("Small", "Medium", "Large")))
How does it look like?
head(myTidyWeight)
## weight Time Chick Diet randomNumber bucket
## 1 42 0 1 1 -0.8122840 Small
## 2 51 2 1 1 -1.1642838 Small
## 3 59 4 1 1 -1.3037687 Small
## 4 64 6 1 1 0.5271401 Large
## 5 76 8 1 1 -0.9956681 Small
## 6 93 10 1 1 -1.1930407 Small
We can use some verbs from the tidyverse to manipulate our dataset, after converting it to Tibble.
as_tibble()
)filter(criteria)
)arrange()
)tibbleChicks <- as_tibble(myTidyWeight)
tibbleChicks %>%
filter(Time == 0 & Diet == 2) %>%
arrange(weight, desc(randomNumber) )
## # A tibble: 10 x 6
## weight Time Chick Diet randomNumber bucket
## <dbl> <dbl> <ord> <fct> <dbl> <fct>
## 1 39 0 27 2 0.720 Large
## 2 39 0 28 2 0.444 Medium
## 3 39 0 29 2 -1.31 Small
## 4 40 0 21 2 1.76 Large
## 5 40 0 25 2 -0.359 Medium
## 6 41 0 22 2 0.608 Large
## 7 42 0 26 2 0.622 Large
## 8 42 0 30 2 0.559 Large
## 9 42 0 24 2 0.0306 Medium
## 10 43 0 23 2 -1.06 Small
We will start by looking the raw data graphically using the ggplot2
package using some relatively simple plots. At this stage don’t worry too much about the details of the commands just try to build your own understanding.
ggplot(ChickWeight) +
aes(Time, weight) +
geom_point()
From the above scatter plot we can see that in general chick weights (vertical axis) increase over time (horizontal axis).
What’s the effect of the diet? We can colour the dots according to the Diet and add some “jitter”, to reduce the overlap in dense areas.
ggplot(ChickWeight) +
aes(Time, weight, colour = Diet ) +
geom_point() +
geom_jitter()
Overlapping is not a major issue here but this looks like four hives of bees spreading out so still not easy to see what the effect of diet. facet_wrap()
allows to split our data into panels. And since the split is done on the Diet column, we can hide the legend.
ggplot(ChickWeight) +
aes(Time, weight, color = Diet) +
geom_point(show.legend = FALSE) +
facet_wrap(~Diet)
If we want a line graph, we need to change the geometry. Since we are interested in the growth of each individual, we need to change the coloring too.
ggplot(ChickWeight) +
aes(Time, weight, color = Chick) +
geom_line(show.legend = FALSE) +
facet_wrap(~Diet)
This is better but we will remove the legend.
ggplot(ChickWeight, aes(Time, weight, group = Chick)) + geom_line() +
facet_wrap(~Diet)
Oh… but now we’ve lost the colours.
ggplot(ChickWeight, aes(Time, weight, group = Chick, colour=Chick)) + geom_line() +
facet_wrap(~Diet)
Perhaps the aesthetics (aes
) need to be in the geom_line
part.
ggplot(ChickWeight) +
aes(Time, weight) +
geom_line(aes(group = Chick, colour=Chick)) +
facet_wrap(~Diet)
## Combining manipulation and plotting
Here we collapse the tibble by Diet and Time, adding the mean and standard error.
# We'll create a dataframe that summarizes (i.e., mean and standard error) the data by both diet group and time point
ChickSummary <- ChickWeight %>%
group_by(Diet, Time) %>%
summarise(Mean = mean(weight), StdErr = sd(weight)/(sqrt(n())))
# Let's take a peek at the data
head(ChickSummary)
Dodging preserves the vertical position of an geom while adjusting the horizontal position (position_dodge()
)
ggplot(ChickSummary) +
aes(x=Diet, y=Mean, fill=as.factor(Time) ) +
geom_bar(stat="identity", position=position_dodge()) +
scale_fill_brewer(palette="Paired") +
ggtitle("Chick Weight by Group") +
guides(fill=guide_legend(title="Time points"))
Let’s filter the first and last time points:
# We'll create a dataframe that summarizes (i.e., mean and standard error) the data by both diet group and time point
ChickExtremeSummary <- filter(ChickWeight, (Time == 0 ) | (Time == 21) ) %>%
group_by(Diet, Time) %>%
summarise(Mean = mean(weight),se = sd(weight)/(sqrt(n())))
# Let's take a peek at the data
head(ChickExtremeSummary)
And plot the result again, this time adding the error bars:
ggplot(ChickExtremeSummary) +
aes(x=Diet, y=Mean, fill=as.factor(Time) ) +
geom_bar(stat="identity", position=position_dodge()) +
scale_fill_brewer(palette="Paired") +
geom_errorbar(aes(ymin=Mean-se, ymax=Mean+se), width=.2,
position=position_dodge(.9)) +
ggtitle("Chick Weight by Group") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(fill=guide_legend(title="Time points"))
To continue your exploration of the tidyverse you can use some resources from DataCamp that use the Gap Minder data set. We knitted the first two here: