In this tutorial we will explore and analyse the results of kraken2 read profiling data using the re-estimatd abundances of bracken. These exercises are meant to show how to conceptually approach your data analysis but there are many more and different ways to explore your data. The most important thing to keep in mind is that you have to understand your own data and analyses. One way to achieve this is to perform visual explorations that help you to judge whether the data are appropriate for your question.

Now let’s start the fun!

Load libraries

library(tidyverse)
library(vegan)
library(coin)
library(pheatmap)

1. Load data

First we need to load our data. Usually the biggest bottleneck between raw data and analyses is to get the data in the right shape for your purpose. Often this requires a little bit of data mingling. On this road - google is your best friend to master the R universe :)

Let’s first load the relative abundance table of the bracken results.

brel<-read_csv(file = "merged_rel_abund_extended.csv")

Then we shorten the column names to just keep the sample IDs.

colnames(brel)<-gsub(".breport", "", colnames(brel))

And finally, we are replacing all NAs with 0. This is because by meging the samples into a single dataframes we introduced these NAs.

brel[is.na(brel)] <- 0

Let’s check how the dataframe (tibble) looks like

head(brel)

## # A tibble: 6 x 35
##   taxlevel  taxid taxa  Sample9 Sample8 Sample7 Sample6 Sample5 Sample4 Sample32
##   <chr>     <dbl> <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>
## 1 R        1.00e0 root    100     100     100     100     100     100      100  
## 2 R1       1.32e5 cell…   100     100     100     100     100     100      100  
## 3 D        2.00e0 Bact…    99.8    99.8    99.1    98.9    99.1    99.2     98.9
## 4 D1       1.78e6 FCB …    72.8    76.0    35.3    77.3    65.2    57.4     85.0
## 5 D2       6.83e4 Bact…    72.8    76.0    35.3    77.3    65.2    57.4     85.0
## 6 P        9.76e2 Bact…    72.8    76.0    35.3    77.3    65.2    57.4     85.0
## # … with 25 more variables: Sample31 <dbl>, Sample30 <dbl>, Sample3 <dbl>,
## #   Sample29 <dbl>, Sample28 <dbl>, Sample27 <dbl>, Sample26 <dbl>,
## #   Sample25 <dbl>, Sample24 <dbl>, Sample23 <dbl>, Sample22 <dbl>,
## #   Sample21 <dbl>, Sample20 <dbl>, Sample2 <dbl>, Sample19 <dbl>,
## #   Sample18 <dbl>, Sample17 <dbl>, Sample16 <dbl>, Sample15 <dbl>,
## #   Sample14 <dbl>, Sample13 <dbl>, Sample12 <dbl>, Sample11 <dbl>,
## #   Sample10 <dbl>, Sample1 <dbl>

Looking good!

Since we want to get most from our data and ideally interpret the patterns, we need to import the metadata.

meta<-read_delim(file="./metadata_extended.csv", delim="\t")
head(meta)

## # A tibble: 6 x 5
##   Sample   group MouseID        timepoint mocktreat
##   <chr>    <chr> <chr>          <chr>     <chr>    
## 1 Sample18 Yp    1              Pre-Abx   A        
## 2 Sample1  Yp    8              Pre-Abx   A        
## 3 Sample2  Yc    2              Pre-Abx   B        
## 4 Sample3  Yc    5              Pre-Abx   A        
## 5 Sample4  Yc    8              Pre-Abx   B        
## 6 Sample5  Pools old feac water Pre-Abx   B

2. Basic stats

Before we start anything, let’s just check out or data a little bit. Never go blind into your analyses.

Q1: How many samples do we have in the matadata and in out data tibble

nrow(meta)

## [1] 32

# excluding the columns not referring to samples
ncol(brel%>%select(-c("taxlevel","taxid","taxa")))

## [1] 32

A1: We have 32 samples in metadata and data tibble. This means we are good to go.

Since the report file contains all taxonomic levels - this is not really easy to understand. Let’s explore how many taxa we have at each taxonomic level. To do that we will first specify the taxonomic levels we are interested in.

# number of taxa for each taxon level
taxcat<-c("S","G","F","O","C","P","D")

The we will use sapply to iterate a function over our data for each taxonomic level. To understand sapply better check out ?sapply.

Q2: How many families were detected in our dataset?

taxcount<-sapply(taxcat, function(tax){
  nrow(subset(brel, taxlevel == tax) %>% # subset only rows that correspond to tax level
        select(-c("taxid","taxlevel")) %>% # remove the columns taxid and taxlevel and only keep the rest
        filter(rowSums(select_if(., is.numeric)) != 0)) # remove all taxa that are 0 in all samples
})

# check out the number of taxa per taxonomic level
taxcount

##  S  G  F  O  C  P  D 
## 85 48 29 16 12  6  2

A2: From the output of taxcount we see that we have 29 families detected

3. Pre-process data

Now that we know a little bit about our data we can start selecting what we want to look at. Let’s extract a separate tibble for species, family and phylum level. At each tibble we remove all the taxa that are not present in any of the samples.

Let’s start with species

brel_spec<-subset(brel, taxlevel == "S") %>% 
  select(-c("taxid","taxlevel")) %>%
  filter(rowSums(select_if(., is.numeric)) != 0)
head(brel_spec)

## # A tibble: 6 x 33
##   taxa  Sample9 Sample8 Sample7 Sample6 Sample5 Sample4 Sample32 Sample31
##   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl>
## 1 Dunc…   12.6     9.34    3.08   20.7    12.9    14.9     26.4     19.6 
## 2 Dunc…    3.13    2.54    3.76    1.08    0.9     1.02     1.72     3.12
## 3 Dunc…   11.0     8.68    4.15   20.1    12.2    12.6     14.8      5.46
## 4 Muri…   17.7    25.1     5.38   10.4     7.53    6.81     9.27    13.5 
## 5 Muri…   23.2    22.8     6.06   16.4    23.4    15.0     25.4     11.0 
## 6 Bact…    1.51    2.83    3.27    2.99    0.34    0.43     0.17     4.48
## # … with 24 more variables: Sample30 <dbl>, Sample3 <dbl>, Sample29 <dbl>,
## #   Sample28 <dbl>, Sample27 <dbl>, Sample26 <dbl>, Sample25 <dbl>,
## #   Sample24 <dbl>, Sample23 <dbl>, Sample22 <dbl>, Sample21 <dbl>,
## #   Sample20 <dbl>, Sample2 <dbl>, Sample19 <dbl>, Sample18 <dbl>,
## #   Sample17 <dbl>, Sample16 <dbl>, Sample15 <dbl>, Sample14 <dbl>,
## #   Sample13 <dbl>, Sample12 <dbl>, Sample11 <dbl>, Sample10 <dbl>,
## #   Sample1 <dbl>

Can you come up with the command for family (F) and phylum (P) level? Tip: make sure you save it to a different variable. Otherwise you are overwriting your species data.

brel_fam<-subset(brel, taxlevel == "F") %>% 
  select(-c("taxid","taxlevel")) %>%
  filter(rowSums(select_if(., is.numeric)) != 0)

brel_phy<-subset(brel, taxlevel == "P") %>% 
  select(-c("taxid","taxlevel")) %>%
  filter(rowSums(select_if(., is.numeric)) != 0)

4. Plot relative abundances

Let’s plot and explore microbial composition and relative abundances in our samples by visualizing them.

As a first overview of coarse differences we can create a stacked bar plot of phyla. We need to mingle our data structure a bit to make it compliant with ggplot and add the metadata to get an extra level of information.

# data mingling
brel_phy_gg1 <- brel_phy %>% gather(key="Sample",value="rel_abun",-taxa)
# now join with metadata by column 'Sample'. We are using left join in case the metadata file contains additional samples not included in our dataset
brel_phy_gg1 <- left_join(brel_phy_gg1,meta,by="Sample") 
# check how it looks
head(brel_phy_gg1)

## # A tibble: 6 x 7
##   taxa            Sample  rel_abun group MouseID timepoint mocktreat
##   <chr>           <chr>      <dbl> <chr> <chr>   <chr>     <chr>    
## 1 Bacteroidetes   Sample9    72.8  O     616     Sch1      A        
## 2 Firmicutes      Sample9    24.4  O     616     Sch1      A        
## 3 Actinobacteria  Sample9     0.18 O     616     Sch1      A        
## 4 Proteobacteria  Sample9     2.38 O     616     Sch1      A        
## 5 Verrucomicrobia Sample9     0.02 O     616     Sch1      A        
## 6 Chordata        Sample9     0.21 O     616     Sch1      A

# bar plot phyla relative abundances
ggplot(brel_phy_gg1, aes(x=Sample, y=rel_abun)) +
  geom_bar(stat="identity",position="stack", aes(fill=taxa)) + # chose bar plot
  theme(axis.text.x = element_text(angle=45, hjust=1)) + # put x-axis label at 45 degree angle
  facet_grid(. ~ mocktreat, scales="free_x",space = "free_x") # produce two panels according to metatadata category 'mocktreat'

Q3: Do the phyla profiles look similar between samples? Can you spot any trends?

A3: We see that the same two Phyla dominate all samples but there is some variability between samples. From a first look there does not seem to be a major difference between the two groups A and B at this taxonomic level.

If you have time you can also visualize the other taxonomic levels (e.g. species) with the same approach. Try to come up with the code yourself. Hint: Omit legend using legend.position.

# bar plot phyla relative abundances
brel_spec_gg1 <- brel_spec %>% gather(key="Sample",value="rel_abun",-taxa)
brel_spec_gg1 <- left_join(brel_spec_gg1,meta,by="Sample")

ggplot(brel_spec_gg1, aes(x=Sample, y=rel_abun)) +
  geom_bar(stat="identity",position="stack", aes(fill=taxa)) +
  theme(axis.text.x = element_text(angle=45, hjust=1),
        legend.position = "none") +
  facet_grid(. ~ mocktreat, scales="free_x",space = "free_x")

Q4: Does the profile change according to taxonomic level? Is the stacked bar plot helpful in all scenarios?

A4: When too many taxa are present, such as at species level, it becomes difficult to distinguish the colors.

As you might realize, when there are too many taxa it becomes very difficult to spot anything in the stacked bar plot. Another way to visualize the relative abundances is by creating a bubble plot. Let’s do that for the family composition.

# bubble plot
brel_fam_gg1 <- brel_fam %>% gather(key="Sample",value="rel_abun",-taxa)
brel_fam_gg1 <- left_join(brel_fam_gg1,meta,by="Sample")

ggplot(brel_fam_gg1, aes(x=Sample, y=taxa)) +
  geom_point(aes(size=rel_abun, color=timepoint), alpha=0.7) + # this time we use points
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  scale_size_continuous(limits = c(0.00001,max(brel_phy_gg1$rel_abun))) + # sets minimum above '0' 
  facet_grid(. ~ mocktreat, scales="free_x",space = "free_x")

## Warning: Removed 310 rows containing missing values (geom_point).

Q5: Did you notice that here we added an extra level of information? Can you spot what it is?

A5: Now we produced panels according to mocktreat and colored according to time point. Like this you can see that we only have two time points in one of the groups. This plot allows us therefore to combine multiple metadata layers.

If you have time you can play with the different information levels and see what this can tell you about your data.

When we want to look at species composition, the communities are often very complex and looking at abundance profiles is not as informative. One way to explore species-level community composition is to filter you data so you focus only on the most abundant taxa across samples. What cutoff and approach you use here of course depends on your question and data.

Let’s create a heatmap of the 20 most abundant spexies in our samples.

# sort species by abundance across samples and select top 20
brel_spec_gg1<-brel_spec[order(rowSums(select_if(brel_spec, is.numeric)),decreasing=T),] %>%
  head(20) %>%
  column_to_rownames("taxa")

# shape the metadata
meta_h <- subset(meta, Sample %in% colnames(brel_spec_gg1)) %>% 
  column_to_rownames("Sample") # shoft the 'Sample' column to rownames

# plot the heatmap
pheatmap::pheatmap(brel_spec_gg1,
                   cluster_rows = TRUE,
                   cluster_cols = TRUE,
                   annotation_col = meta_h[,c(1,3,4)],
                   annotation_names_col=TRUE)

Q6: what can you learn from the heatmap. Are there any informative clusters?

A6: We can see that 4 species dominate the communities in most samples. Adding the metadata we can also see that data do not cluster strongly according to group or time point, but there is some degree of structuring in mocktreat.

5. Beta diversity

Often we want to know whether the microbiomes are different between conditions or groups. One way to explore this is to look at the beta-diversity in an ordination. There are different distances and approaches that can be done and explored. We will perform an NMDS on bray curtis dissimilarities of the species profiles.

# To ensure reproducibility we can fix the seed here. This will ensure you always get the same result each time you run your data.
set.seed(34521)

# Data mingling
brel_spec_gg2 <- brel_spec %>% 
  column_to_rownames("taxa") %>% 
  t() # transpose

# Calculate distance matrix
brel_spec_gg2_dist <- vegdist(brel_spec_gg2, method = "bray")

# Perform NMDS on distance matrix
nmds_spec<-metaMDS(brel_spec_gg2_dist,distance = "bray",k = 2)

## Run 0 stress 0.14795 
## Run 1 stress 0.1795765 
## Run 2 stress 0.1936137 
## Run 3 stress 0.2378564 
## Run 4 stress 0.14795 
## ... Procrustes: rmse 4.309274e-05  max resid 0.0002094073 
## ... Similar to previous best
## Run 5 stress 0.1644519 
## Run 6 stress 0.169285 
## Run 7 stress 0.1835296 
## Run 8 stress 0.1920566 
## Run 9 stress 0.14795 
## ... Procrustes: rmse 4.490045e-05  max resid 0.0002129123 
## ... Similar to previous best
## Run 10 stress 0.1839323 
## Run 11 stress 0.14795 
## ... Procrustes: rmse 4.075948e-05  max resid 0.0001818523 
## ... Similar to previous best
## Run 12 stress 0.1835298 
## Run 13 stress 0.14795 
## ... New best solution
## ... Procrustes: rmse 1.20264e-05  max resid 5.395265e-05 
## ... Similar to previous best
## Run 14 stress 0.1692845 
## Run 15 stress 0.2115505 
## Run 16 stress 0.1792405 
## Run 17 stress 0.14795 
## ... Procrustes: rmse 1.608663e-05  max resid 7.697304e-05 
## ... Similar to previous best
## Run 18 stress 0.14795 
## ... Procrustes: rmse 3.475153e-05  max resid 0.0001652939 
## ... Similar to previous best
## Run 19 stress 0.1908263 
## Run 20 stress 0.1782935 
## *** Solution reached

# Check the output
nmds_spec

## 
## Call:
## metaMDS(comm = brel_spec_gg2_dist, distance = "bray", k = 2) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     brel_spec_gg2_dist 
## Distance: bray 
## 
## Dimensions: 2 
## Stress:     0.14795 
## Stress type 1, weak ties
## Two convergent solutions found after 20 tries
## Scaling: centring, PC rotation 
## Species: scores missing

Here you see a kind of summary of the analysis. For example, you can see that you used 2 dimensions and the stress was approx. 0.15. In general if a stress is above 0.2 then the clustering is not reliably representing the data and should be interpreted with caution. But here the stress is below 0.2, so we are okay.

Now let’s look at the ordination. To plot the data with ggplot, we need to extract the coordinaties of each point from nmds_spec$points.

# Extract and reshape the data to plot ordination as ggplot  and add the metadata
nmds_spec_gg<-as.data.frame(nmds_spec$points) %>%
  rownames_to_column("Sample") %>%
  left_join(meta, by="Sample")

Then we can greate the plot easily and color according to the metadata. We are choosing timepoint and mocktreat for the coloring respectively. But feel free to explore other parameters.

# Let's plot and color according to time point
ggplot(nmds_spec_gg, aes(x=MDS1,y=MDS2)) +
  geom_point(aes(color=timepoint), size=3, alpha=0.5) +
  ggtitle("NMDS colored according to timepoint")

# Now let's plot according to mocktreat
ggplot(nmds_spec_gg, aes(x=MDS1,y=MDS2)) +
  geom_point(aes(color=mocktreat), size=3, alpha=0.5)+
  ggtitle("NMDS colored according to mocktreat")

Q7: Can you spot clusters? Do they correspond to metadata groupings or conditions?

A7: The timepoints are not well separated. But we also have a lot more data from one of the two groups. The groups of mocktreat data appear nicely in two clusters.

If we want to test whether the variables significantly explain some of the variance we see, we can perform an Permutational analysis of variance (PERMANOVA) using the adonis2 package. Check out ?adonis2 is you want to learn more.

To test the overall effect of the timepoint we can run the following

set.seed(34521) # set seed for reproducibility

# PERMANOVA
adonis2(brel_spec_gg2_dist ~ timepoint, data = as.data.frame(nmds_spec_gg))

## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = brel_spec_gg2_dist ~ timepoint, data = as.data.frame(nmds_spec_gg))
##           Df SumOfSqs      R2      F Pr(>F)  
## timepoint  1  0.12839 0.06337 2.0296  0.068 .
## Residual  30  1.89778 0.93663                
## Total     31  2.02617 1.00000                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How would you run the PERMANOVA for mocktreat?

set.seed(34521)
adonis2(brel_spec_gg2_dist ~ mocktreat, data = as.data.frame(nmds_spec_gg))

## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = brel_spec_gg2_dist ~ mocktreat, data = as.data.frame(nmds_spec_gg))
##           Df SumOfSqs     R2      F Pr(>F)    
## mocktreat  1  0.28407 0.1402 4.8918  0.001 ***
## Residual  30  1.74210 0.8598                  
## Total     31  2.02617 1.0000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Q8: Explore both PERMANOVA outputs. Does any of the two factors timepoint and mocktreat significantly explain the variance observed in the data?

A8: The timepoints do not significantly explain any of the variance. The mocktreat does significantly explain approx. 14% of the variance.

6. Differential abundance

Differential abundance analysis is a huge discussion field in the statistical community and we cannot cover this in this course. We will give you an example of a very simple non-parametric approach. However, most of the times your data are much more complex, which will require you to use more sophisticated methods. It is important to discuss this with a statistician or someone with the statistical knowledge about these time of anaylsis. Most importantly, if you want to perform a differential abundance analysis you need to start planning your experimental set up accordingly from the beginning and include proper controls. Always make sure you understand what you are doing, to be able to judge the aproprietness of the method and interpret the results correctly.

6.1. Extract species relative abundance table

First we need to extract the species abundance table.

# create a df with all taxa and all variables
taxtab<-as.data.frame(t(brel_spec)) # transpose the data
names<-as.character(unlist(taxtab["taxa",])) # save column names to variable 
taxtab<-taxtab[!rownames(taxtab) %in% "taxa", ] # only keep rows with sample names
sampnames<-rownames(taxtab) # save rownames to variable
taxtab<-sapply(taxtab, as.character) # transform class of values to characters
taxtab<-as.data.frame(matrix(as.numeric(taxtab), ncol = ncol(taxtab))) # transform to numeric matrix
names(taxtab)<-names # add back the column names
rownames(taxtab)<-sampnames # add back the rownames

# add metadata
taxtab<-taxtab %>%
  rownames_to_column("Sample") %>%
  left_join(meta, by="Sample") %>% 
  mutate_if(is.character,as.factor) # make sure variables are factors

Now let’s pick one taxon Muribaculum intestinale and check how the data are distributed.

Q9: Are the data normally distributed?

We can quickly check this for one taxon.

# create a Q-Q plot
qqnorm((taxtab[,3]))
qqline((taxtab[,3]), col = "steelblue")

# plot the data distribution
ggplot(taxtab, aes(x=`Muribaculum intestinale`)) +
  geom_density(alpha = 0.5)

A9: The data of Muribaculum intestinale are clearly not normally distributed as can be seen from the distribution and the Q-Q plot.

Microbiome data often are not normally distributed, which means assumptions like data normality may not be appropriate. One possibility is to perform a non-parametric test which does not assume normailty of the data. For two-group comparisons this can be a Mann-Whitney-Wilcoxon test (wilcoxon) and for more than 2 groups you could use a Kruskal-Wallis test. These tests can be paired if this information (e.g. the same individual before and after treatment) is avaible but in our case we don’t have this Information.

Let’s practise and perform a wilcoxon test from the coin package on one taxon.

# run wilcoxon test
wtsingle <- coin::wilcox_test(taxtab[,"Muribaculum intestinale"] ~ mocktreat, data = taxtab, confint=TRUE)

# get the p-value
coin::pvalue(wtsingle)

## [1] 0.4168496

Q10: Is this taxon significantly different between timepoints?

A10: No, this taxon is not significantly different between timepoints (p-value > 0.05).

Why is this the case? Let’s look at the box plot to see.

Mint <- taxtab %>% 
  select(c("Muribaculum intestinale","mocktreat")) %>% 
  gather(key=taxa,value = rel_abund,-mocktreat)

ggplot(Mint, aes(x=mocktreat,y=rel_abund)) +
  geom_boxplot(aes(color=mocktreat)) +
  geom_jitter(aes(x=mocktreat,color=mocktreat))

6.2. Perform Wilcoxon test to compare two groups

First we create an empty dataframe to collect the results from the multiple tests for each species.

# save all taxa names
taxnames<-brel_spec$taxa

# empty dataframe, setting up all required columns
df_w<-data.frame(pvalue= vector(length = length(taxnames)), 
                 padjust = vector(length = length(taxnames)))
rownames(df_w)<-taxnames # set rownames to taxa

Then we run the wilcoxon test in a for-loop for each species and save the p-value to our dataframe.

for (taxa in taxnames) {
    wt <- coin::wilcox_test(taxtab[,taxa] ~ mocktreat, data = taxtab)
    df_w[taxa,"pvalue"] <- coin::pvalue(wt)
}

Since now we performed many statistical tests, we need to adjust our p-value for multiple testing. Find out more here ?p.adjust. Here we do this according to the Benjamini-Hochberg (BH) correction.

# Correct p-values for multiple testing (using BH)
df_w$padjust <- p.adjust(df_w$pvalue, method = "BH")
head(df_w)

##                                pvalue    padjust
## Duncaniella sp. C9        0.664091243 0.77325693
## Duncaniella sp. B8        0.039547249 0.15931022
## Duncaniella dubosii       0.416806869 0.56666797
## Muribaculum intestinale   0.416849571 0.56666797
## Muribaculum gordoncarteri 0.002366559 0.02235084
## Bacteroides caecimuris    0.075761641 0.21671976

You can see that we have unadjusted and adjusted p-values for each species.

Q11: How many species were differentially abundant after p-vajue correction? Which species?

# only significant taxa
df_w_sig <- subset(df_w, padjust<0.05)
head(df_w_sig)

##                                    pvalue    padjust
## Muribaculum gordoncarteri    0.0023665591 0.02235084
## Acutalibacter muris          0.0006311980 0.01660328
## Faecalibacterium prausnitzii 0.0008833756 0.01660328
## Flavonifractor plautii       0.0009766638 0.01660328
## Lachnoclostridium phocaeense 0.0045816321 0.03894387
## Blautia producta             0.0004400265 0.01660328

sigtax<-rownames(df_w_sig)
sigtax

##  [1] "Muribaculum gordoncarteri"    "Acutalibacter muris"         
##  [3] "Faecalibacterium prausnitzii" "Flavonifractor plautii"      
##  [5] "Lachnoclostridium phocaeense" "Blautia producta"            
##  [7] "Blautia sp. SC05B48"          "Enterococcus faecium"        
##  [9] "[Clostridium] innocuum"       "Adlercreutzia equolifaciens"

A11: 10 species are significantly different between the two mocktreats: Muribaculum gordoncarteri, Acutalibacter muris, Faecalibacterium prausnitzii, Flavonifractor plautii, Lachnoclostridium phocaeense, Blautia producta, Blautia sp. SC05B48, Enterococcus faecium, [Clostridium] innocuum, Adlercreutzia equolifaciens

Now plot the significantly different species as boxplot to see the shift.

sigtax<-taxtab %>%
  select(c(all_of(sigtax),mocktreat)) %>%
  gather(key=taxa,value = rel_abund,-mocktreat)

ggplot(sigtax, aes(x=mocktreat,y=rel_abund)) +
  geom_boxplot(aes(color=mocktreat)) +
  theme(axis.text.x = element_text(angle=45,hjust = 1)) +
  geom_jitter(aes(color=mocktreat), alpha=0.5, size=2) +
  ggtitle("Significantly different taxa") +
  facet_grid(. ~ taxa)

Q12: What do you see in the box plots?

A12: The boxplots show the shift in relative abundance between mocktreats. For most taxa it is obvious that despite a significant p-value the shift in relative abundance is extremely minor and it may be questionable whether it reflects biologically meaningful results. Acutalibacter muris and Muribaculum gordoncarteri instead shows a stronger effect and may be more interesting to explore further.

Explore metagenomic read profiles - solutions

Rebecca Ansorge

06/07/2021