Load libraries

library(tidyverse)

Import data

We are importing the metadata which can be found here:

/data/shared/Rfiles/metadata_sub.csv

We are also importing the two files classification_counts.csv and hostread_counts.csv that you created in yesterdays session. If you haven’t created those files, then we provided these files for you here:

/data/shared/Rfiles/classification_counts.csv

/data/shared/Rfiles/hostread_counts.csv

Make sure you copy this file to your current working directory or specify the correct path to the file when importing.

Import the metadata and check the input file using the head() command

meta<-read_delim(file="metadata_sub.csv", delim="\t")
head(meta)

Import the classification counts and host read counts as follows

classified<-read_delim(file="classification_counts.csv", delim=",")
head(classified)
hostreads<-read_delim(file="hostread_counts.csv", delim=",")
head(hostreads)

We see in the files we still have .report in the sample names. In order to be able to merge this data with the metadata we need to remove this string and just keep the sample names.

colnames(classified)<-gsub(".report", "", colnames(classified))
colnames(hostreads)<-gsub(".report", "", colnames(hostreads))

head(classified)
head(hostreads)

Preprocess data

We want to merge metadata and read counts. To do that we first need to re-arrange our data to make sure that we have the samples as column and not as column names.

classified<- classified %>% 
  select(-c(taxid,taxlevel)) %>% # this removes 2 columns we don't need
  gather(Sample, value,-taxa) %>%  # this gathers the data into columns except for column taxa
  spread(taxa, value) %>% # this spreads out again the data into the required format
  rename("classified"=root) # here we rename the column root into 'classified'
head(classified)
hostreads<- hostreads %>% 
  select(-c(taxid,taxlevel)) %>% 
  gather(Sample, value,-taxa) %>% 
  spread(taxa, value) %>% 
  rename("host"=root) %>%
  rename("nohost"=unclassified) 
head(hostreads)

Join with metadata

Now we are ready to join with the metadata by their common Sample column. We use full_join to do that. Check out the resulting data.

metainfo<-meta %>%
  full_join(classified,by="Sample") %>%
  full_join(hostreads,by="Sample")

head(metainfo)

Let’s plot the data

First we check how many host reads we had in the beginning

# gather the columns to make them suitable with ggplot
metainfo_gg1 <- metainfo %>% 
  gather(key="host",value="counts",-c(Sample, group, MouseID, timepoint, mocktreat, classified, unclassified))
head(metainfo_gg1)

Now create a stacked bar plot

ggplot(metainfo_gg1, aes(x=Sample, y=counts)) +
  geom_bar(stat="identity",position="stack", aes(fill=host)) +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  facet_grid(. ~ mocktreat, scales="free_x",space = "free_x")

Interesting, Sample22 seems to have had much more host than the others.

Now let’s take a look how many reads of the decontaminated data were actually classified by your kraken2 run.

metainfo_gg2 <- metainfo %>% 
  gather(key="classified",value="counts",-c(Sample, group, MouseID, timepoint, mocktreat, host, nohost))
head(metainfo_gg2)

ggplot(metainfo_gg2, aes(x=Sample, y=counts)) +
  geom_bar(stat="identity",position="stack", aes(fill=classified)) +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  facet_grid(. ~ mocktreat, scales="free_x",space = "free_x")

Let’s now plot host reads, classified fraction and unclassified fraction into the same plot.

metainfo_gg3 <- metainfo %>% 
  gather(key="reads",value="counts",-c(Sample, group, MouseID, timepoint, mocktreat, nohost))
head(metainfo_gg3)

ggplot(metainfo_gg3, aes(x=Sample, y=counts)) +
  geom_bar(stat="identity",position="stack", aes(fill=reads)) +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  facet_grid(. ~ mocktreat, scales="free_x",space = "free_x")

In general we can see that the number of reads classified by kraken2 were pretty similar across the samples.