In this tutorial we will explore and analyse the results of kraken2 read profiling data using the re-estimatd abundances of bracken. These exercises are meant to show how to conceptually approach your data analysis but there are many more and different ways to explore your data. The most important thing to keep in mind is that you have to understand your own data and analyses. One way to achieve this is to perform visual explorations that help you to judge whether the data are appropriate for your question.

Now let’s start the fun!

Load libraries

library(tidyverse)
library(vegan)
library(coin)
library(pheatmap)

1. Load data

First we need to load our data. Usually the biggest bottleneck between raw data and analyses is to get the data in the right shape for your purpose. Often this requires a little bit of data mingling. On this road - google is your best friend to master the R universe :)

Let’s first load the relative abundance table of the bracken results.

brel<-read_csv(file = "merged_rel_abund_extended.csv") 

Then we shorten the column names to just keep the sample IDs.

colnames(brel)<-gsub(".breport", "", colnames(brel))

And finally, we are replacing all NAs with 0. This is because by meging the samples into a single dataframes we introduced these NAs.

brel[is.na(brel)] <- 0

Let’s check how the dataframe (tibble) looks like

head(brel)

Looking good!

Since we want to get most from our data and ideally interpret the patterns, we need to import the metadata.

meta<-read_delim(file="./metadata_extended.csv", delim="\t")
head(meta)

2. Basic stats

Before we start anything, let’s just check out or data a little bit. Never go blind into your analyses.

Q1: How many samples do we have in the matadata and in out data tibble

nrow(meta)

# excluding the columns not referring to samples
ncol(brel%>%select(-c("taxlevel","taxid","taxa")))

Since the report file contains all taxonomic levels - this is not really easy to understand. Let’s explore how many taxa we have at each taxonomic level. To do that we will first specify the taxonomic levels we are interested in.

# number of taxa for each taxon level
taxcat<-c("S","G","F","O","C","P","D")

The we will use sapply to iterate a function over our data for each taxonomic level. To understand sapply better check out ?sapply.

Q2: How many families were detected in our dataset?

taxcount<-sapply(taxcat, function(tax){
  nrow(subset(brel, taxlevel == tax) %>% # subset only rows that correspond to tax level
        select(-c("taxid","taxlevel")) %>% # remove the columns taxid and taxlevel and only keep the rest
        filter(rowSums(select_if(., is.numeric)) != 0)) # remove all taxa that are 0 in all samples
})

# check out the number of taxa per taxonomic level
taxcount

3. Pre-process data

Now that we know a little bit about our data we can start selecting what we want to look at. Let’s extract a separate tibble for species, family and phylum level. At each tibble we remove all the taxa that are not present in any of the samples.

Let’s start with species

brel_spec<-subset(brel, taxlevel == "S") %>% 
  select(-c("taxid","taxlevel")) %>%
  filter(rowSums(select_if(., is.numeric)) != 0)
head(brel_spec)

Can you come up with the command for family (F) and phylum (P) level? Tip: make sure you save it to a different variable. Otherwise you are overwriting your species data.

brel_fam<-subset(brel, taxlevel == "F") %>% 
  select(-c("taxid","taxlevel")) %>%
  filter(rowSums(select_if(., is.numeric)) != 0)

brel_phy<-subset(brel, taxlevel == "P") %>% 
  select(-c("taxid","taxlevel")) %>%
  filter(rowSums(select_if(., is.numeric)) != 0)

4. Plot relative abundances

Let’s plot and explore microbial composition and relative abundances in our samples by visualizing them.

As a first overview of coarse differences we can create a stacked bar plot of phyla. We need to mingle our data structure a bit to make it compliant with ggplot and add the metadata to get an extra level of information.

# data mingling
brel_phy_gg1 <- brel_phy %>% gather(key="Sample",value="rel_abun",-taxa)
# now join with metadata by column 'Sample'. We are using left join in case the metadata file contains additional samples not included in our dataset
brel_phy_gg1 <- left_join(brel_phy_gg1,meta,by="Sample") 
# check how it looks
head(brel_phy_gg1) 

# bar plot phyla relative abundances
ggplot(brel_phy_gg1, aes(x=Sample, y=rel_abun)) +
  geom_bar(stat="identity",position="stack", aes(fill=taxa)) + # chose bar plot
  theme(axis.text.x = element_text(angle=45, hjust=1)) + # put x-axis label at 45 degree angle
  facet_grid(. ~ mocktreat, scales="free_x",space = "free_x") # produce two panels according to metatadata category 'mocktreat'

Q3: Do the phyla profiles look similar between samples? Can you spot any trends?

If you have time you can also visualize the other taxonomic levels (e.g. species) with the same approach. Try to come up with the code yourself. Hint: Omit legend using legend.position.

# bar plot phyla relative abundances
brel_spec_gg1 <- brel_spec %>% gather(key="Sample",value="rel_abun",-taxa)
brel_spec_gg1 <- left_join(brel_spec_gg1,meta,by="Sample")

ggplot(brel_spec_gg1, aes(x=Sample, y=rel_abun)) +
  geom_bar(stat="identity",position="stack", aes(fill=taxa)) +
  theme(axis.text.x = element_text(angle=45, hjust=1),
        legend.position = "none") +
  facet_grid(. ~ mocktreat, scales="free_x",space = "free_x")

Q4: Does the profile change according to taxonomic level? Is the stacked bar plot helpful in all scenarios?

As you might realize, when there are too many taxa it becomes very difficult to spot anything in the stacked bar plot. Another way to visualize the relative abundances is by creating a bubble plot. Let’s do that for the family composition.

# bubble plot
brel_fam_gg1 <- brel_fam %>% gather(key="Sample",value="rel_abun",-taxa)
brel_fam_gg1 <- left_join(brel_fam_gg1,meta,by="Sample")

ggplot(brel_fam_gg1, aes(x=Sample, y=taxa)) +
  geom_point(aes(size=rel_abun, color=timepoint), alpha=0.7) + # this time we use points
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  scale_size_continuous(limits = c(0.00001,max(brel_phy_gg1$rel_abun))) + # sets minimum above '0' 
  facet_grid(. ~ mocktreat, scales="free_x",space = "free_x")

Q5: Did you notice that here we added an extra level of information? Can you spot what it is?

If you have time you can play with the different information levels and see what this can tell you about your data.

When we want to look at species composition, the communities are often very complex and looking at abundance profiles is not as informative. One way to explore species-level community composition is to filter you data so you focus only on the most abundant taxa across samples. What cutoff and approach you use here of course depends on your question and data.

Let’s create a heatmap of the 20 most abundant spexies in our samples.

# sort species by abundance across samples and select top 20
brel_spec_gg1<-brel_spec[order(rowSums(select_if(brel_spec, is.numeric)),decreasing=T),] %>%
  head(20) %>%
  column_to_rownames("taxa")

# shape the metadata
meta_h <- subset(meta, Sample %in% colnames(brel_spec_gg1)) %>% 
  column_to_rownames("Sample") # shoft the 'Sample' column to rownames

# plot the heatmap
pheatmap::pheatmap(brel_spec_gg1,
                   cluster_rows = TRUE,
                   cluster_cols = TRUE,
                   annotation_col = meta_h[,c(1,3,4)],
                   annotation_names_col=TRUE)

Q6: what can you learn from the heatmap. Are there any informative clusters?


5. Beta diversity

Often we want to know whether the microbiomes are different between conditions or groups. One way to explore this is to look at the beta-diversity in an ordination. There are different distances and approaches that can be done and explored. We will perform an NMDS on bray curtis dissimilarities of the species profiles.

# To ensure reproducibility we can fix the seed here. This will ensure you always get the same result each time you run your data.
set.seed(34521)

# Data mingling
brel_spec_gg2 <- brel_spec %>% 
  column_to_rownames("taxa") %>% 
  t() # transpose

# Calculate distance matrix
brel_spec_gg2_dist <- vegdist(brel_spec_gg2, method = "bray")

# Perform NMDS on distance matrix
nmds_spec<-metaMDS(brel_spec_gg2_dist,distance = "bray",k = 2)

# Check the output
nmds_spec

Here you see a kind of summary of the analysis. For example, you can see that you used 2 dimensions and the stress was approx. 0.15. In general if a stress is above 0.2 then the clustering is not reliably representing the data and should be interpreted with caution. But here the stress is below 0.2, so we are okay.

Now let’s look at the ordination. To plot the data with ggplot, we need to extract the coordinaties of each point from nmds_spec$points.

# Extract and reshape the data to plot ordination as ggplot  and add the metadata
nmds_spec_gg<-as.data.frame(nmds_spec$points) %>%
  rownames_to_column("Sample") %>%
  left_join(meta, by="Sample")

Then we can greate the plot easily and color according to the metadata. We are choosing timepoint and mocktreat for the coloring respectively. But feel free to explore other parameters.

# Let's plot and color according to time point
ggplot(nmds_spec_gg, aes(x=MDS1,y=MDS2)) +
  geom_point(aes(color=timepoint), size=3, alpha=0.5) +
  ggtitle("NMDS colored according to timepoint")
# Now let's plot according to mocktreat
ggplot(nmds_spec_gg, aes(x=MDS1,y=MDS2)) +
  geom_point(aes(color=mocktreat), size=3, alpha=0.5)+
  ggtitle("NMDS colored according to mocktreat")

Q7: Can you spot clusters? Do they correspond to metadata groupings or conditions?

If we want to test whether the variables significantly explain some of the variance we see, we can perform an Permutational analysis of variance (PERMANOVA) using the adonis2 package. Check out ?adonis2 is you want to learn more.

To test the overall effect of the timepoint we can run the following

set.seed(34521) # set seed for reproducibility

# PERMANOVA
adonis2(brel_spec_gg2_dist ~ timepoint, data = as.data.frame(nmds_spec_gg))

How would you run the PERMANOVA for mocktreat?

set.seed(34521)
adonis2(brel_spec_gg2_dist ~ mocktreat, data = as.data.frame(nmds_spec_gg))

Q8: Explore both PERMANOVA outputs. Does any of the two factors timepoint and mocktreat significantly explain the variance observed in the data?

6. Differential abundance

Differential abundance analysis is a huge discussion field in the statistical community and we cannot cover this in this course. We will give you an example of a very simple non-parametric approach. However, most of the times your data are much more complex, which will require you to use more sophisticated methods. It is important to discuss this with a statistician or someone with the statistical knowledge about these time of anaylsis. Most importantly, if you want to perform a differential abundance analysis you need to start planning your experimental set up accordingly from the beginning and include proper controls. Always make sure you understand what you are doing, to be able to judge the aproprietness of the method and interpret the results correctly.

6.1. Extract species relative abundance table

First we need to extract the species abundance table.

# create a df with all taxa and all variables
taxtab<-as.data.frame(t(brel_spec)) # transpose the data
names<-as.character(unlist(taxtab["taxa",])) # save column names to variable 
taxtab<-taxtab[!rownames(taxtab) %in% "taxa", ] # only keep rows with sample names
sampnames<-rownames(taxtab) # save rownames to variable
taxtab<-sapply(taxtab, as.character) # transform class of values to characters
taxtab<-as.data.frame(matrix(as.numeric(taxtab), ncol = ncol(taxtab))) # transform to numeric matrix
names(taxtab)<-names # add back the column names
rownames(taxtab)<-sampnames # add back the rownames

# add metadata
taxtab<-taxtab %>%
  rownames_to_column("Sample") %>%
  left_join(meta, by="Sample") %>% 
  mutate_if(is.character,as.factor) # make sure variables are factors

Now let’s pick one taxon Muribaculum intestinale and check how the data are distributed.

Q9: Are the data normally distributed?

We can quickly check this for one taxon.

# create a Q-Q plot
qqnorm((taxtab[,3]))
qqline((taxtab[,3]), col = "steelblue")
# plot the data distribution
ggplot(taxtab, aes(x=`Muribaculum intestinale`)) +
  geom_density(alpha = 0.5)

Microbiome data often are not normally distributed, which means assumptions like data normality may not be appropriate. One possibility is to perform a non-parametric test which does not assume normailty of the data. For two-group comparisons this can be a Mann-Whitney-Wilcoxon test (wilcoxon) and for more than 2 groups you could use a Kruskal-Wallis test. These tests can be paired if this information (e.g. the same individual before and after treatment) is avaible but in our case we don’t have this Information.

Let’s practise and perform a wilcoxon test from the coin package on one taxon.

# run wilcoxon test
wtsingle <- coin::wilcox_test(taxtab[,"Muribaculum intestinale"] ~ mocktreat, data = taxtab, confint=TRUE)

# get the p-value
coin::pvalue(wtsingle)

Q10: Is this taxon significantly different between timepoints?

Why is this the case? Let’s look at the box plot to see.

Mint <- taxtab %>% 
  select(c("Muribaculum intestinale","mocktreat")) %>% 
  gather(key=taxa,value = rel_abund,-mocktreat)

ggplot(Mint, aes(x=mocktreat,y=rel_abund)) +
  geom_boxplot(aes(color=mocktreat)) +
  geom_jitter(aes(x=mocktreat,color=mocktreat))

6.2. Perform Wilcoxon test to compare two groups

First we create an empty dataframe to collect the results from the multiple tests for each species.

# save all taxa names
taxnames<-brel_spec$taxa

# empty dataframe, setting up all required columns
df_w<-data.frame(pvalue= vector(length = length(taxnames)), 
                 padjust = vector(length = length(taxnames)))
rownames(df_w)<-taxnames # set rownames to taxa

Then we run the wilcoxon test in a for-loop for each species and save the p-value to our dataframe.

for (taxa in taxnames) {
    wt <- coin::wilcox_test(taxtab[,taxa] ~ mocktreat, data = taxtab)
    df_w[taxa,"pvalue"] <- coin::pvalue(wt)
}

Since now we performed many statistical tests, we need to adjust our p-value for multiple testing. Find out more here ?p.adjust. Here we do this according to the Benjamini-Hochberg (BH) correction.

# Correct p-values for multiple testing (using BH)
df_w$padjust <- p.adjust(df_w$pvalue, method = "BH")
head(df_w)

You can see that we have unadjusted and adjusted p-values for each species.

Q11: How many species were differentially abundant after p-vajue correction? Which species?

# only significant taxa
df_w_sig <- subset(df_w, padjust<0.05)
head(df_w_sig)

sigtax<-rownames(df_w_sig)
sigtax

Now plot the significantly different species as boxplot to see the shift.

sigtax<-taxtab %>%
  select(c(all_of(sigtax),mocktreat)) %>%
  gather(key=taxa,value = rel_abund,-mocktreat)

ggplot(sigtax, aes(x=mocktreat,y=rel_abund)) +
  geom_boxplot(aes(color=mocktreat)) +
  theme(axis.text.x = element_text(angle=45,hjust = 1)) +
  geom_jitter(aes(color=mocktreat), alpha=0.5, size=2) +
  ggtitle("Significantly different taxa") +
  facet_grid(. ~ taxa)

Q12: What do you see in the box plots?