Taxonomic profiling of whole metagenome shotgun (day 2)
Today we will use Braken to recalibrate our estimations, and a set of scripts to merge multiple samples in a single table, and see how to filter it. We will prepare a MultiQC report with our QC, Kraken2 and Bracken data.
A small test on Kraken2
After making a very small test, we can also try with the CAMI mock sample, that you might have in ~/sequences/
,
or that you can find in /data/cami/
.
Taxonomy profiling with Kraken2
We can now profile our samples, saving both the “raw” output and the report.
We should be now familiar with for
loops, so why don’t you write yours using the
following template?
Save it as ~/kraken.sh
so that we can check it!
1
2
3
4
5
6
DB={fill_this}
for READS_FOR in ~/kraken-ws/filt/*_R1*.gz;
do
READS_REV=${READS_FOR/_R1/_R2}
kraken2 {fill_this}
done
In the next paragraphs you’ll find a complete script that does Kraken and Bracken in one go, but you can see a Kraken2 solution here.
Bracken
Bracken will use as input the Kraken2 report (not the tabular output), and needs to know the read length of our library (we can check it with SeqFu stats, for example).
The general syntax is:
1
bracken -d MY_DB -i INPUT -o OUTPUT -w OUTREPORT -r READ_LEN -l LEVEL -t THRESHOLD
Where:
- MY_DB is the database, that should be the same used for Kraken2 (and adapted for Bracken)
- INPUT is the report produced by Kraken2
- OUTPUT is the tabular output, while OUTREPORT is a Kraken style report (recalibrated)
-
LEVEL is the taxonomic level (usually
S
for species) - THRESHOLD it’s the minimum number of reads required (default is 10)
Run bracken on one of the samples, and check the output files produced.
A combined script for Kraken and Bracken
To put together some of the information written in the Bash scripting notes,
we can create a classify.sh
script to process a directory of paired-end
reads and produce two output folders (in the directory where the script
was invoked):
-
kraken (containing the reports as
*.report
and raw output as*.tsv
) -
bracken (containing the reports as
*.breport
and tabular output as*.tsv
)
We hosted the source on GitHub, and you can quickly download it as:
1
wget "https://gist.githubusercontent.com/telatin/fa79d013707a293c0c3ff019abc7313d/raw/kraken.sh"
and run it as:
1
bash classify.sh [input_dir]
- See the full code
Visualization and reports
MultiQC
MultiQC is a fantastic tool that can aggregate outputs from different bioinformatics programs in a single report. We will combine our fastp and Kraken2 classifications to have a single report.
Krona plots
Krona is a flexible tool to generate interactive pie charts. We have a dedicated tutorial on how to produce an HTML interactive plot using Krona.
The procedure works both on Kraken2 and Bracken report files.
Pavian
Pavian is an R package that will generate a web server in
your local machine that you can access from (for example) http://localhost:5000
.
Pavian can import Kraken reports, but also Metaphlan and Centrifuge files. If you want to try without installing R, there is a web app.
- Link: an example Pavian report
Combining the reports
Since we have to run kraken2 and bracken on a per-sample basis, it is helpful to combine the report files into a single table containing all observations and species before we jump into R. I wrote a small script that should do the merging of bracken (or kraken2) reports for you.
To use it we need to install pandas first
1
conda install pandas
The you can merge either a kraken2 or bracken reports with a custom script, as long as you supply
an input directory with a set of *.report
files.
1
2
# Check you have the files you expect
ls -l bracken/*.report
1
python /data/shared/scripts/merge_profiling_reports.py -i bracken/ -o merged
merge_profiling_reports.py source
This produces 2 files in the same directory where the input files are (in our example ./bracken):
-
merged_rel_abund.csv
: contains table for all samples with bracken relative abundances and taxonimic assignments -
merged_read_numbers.csv
: contains table for all samples with bracken read counts and taxonimic assignments
We will use these files for the data exploration and analysis on day 3.
But feel free to sneak peak with head merged_rel_abund.csv
.
Get an overview for the number of host and unclassified reads
One information that get’s lost when running bracken, is the number of unclassified reads. However, this is a very important information when investigating (and comparing) samples. To get this data, let’s merge the kraken reports in the same way we merged the bracken reports.
1
python /data/shared/scripts/merge_profiling_reports.py -i ~/kraken-ws/kraken/ -o merged
Then we extract the read counts of unclassified and those assigned to ‘root’ from the kraken2 reports
1
head -3 ~/kraken-ws/kraken/merged_read_numbers.csv > ~/kraken-ws/kraken/classification_counts.csv
Now same thing for the host removal output from day 1
1
2
3
4
5
python /data/shared/scripts/merge_profiling_reports.py \
-i ~/kraken-ws/reads-no-host/ \
-o merged_hostremoval
head -3 ~/kraken-ws/reads-no-host/merged_hostremoval_read_numbers.csv > ~/kraken-ws/reads-no-host/hostread_counts.csv
We will explore these files tomorrow.
Optional track: Kaiju