Andrea Telatin
Andrea Telatin Senior bioinformatician at the Quadram Institute Bioscience, Norwich.

Taxonomic profiling of whole metagenome shotgun (day 2)

Taxonomic profiling of whole metagenome shotgun (day 2)

Today we will use Braken to recalibrate our estimations, and a set of scripts to merge multiple samples in a single table, and see how to filter it. We will prepare a MultiQC report with our QC, Kraken2 and Bracken data.

A small test on Kraken2

After making a very small test, we can also try with the CAMI mock sample, that you might have in ~/sequences/, or that you can find in /data/cami/.

Taxonomy profiling with Kraken2

We can now profile our samples, saving both the “raw” output and the report. We should be now familiar with for loops, so why don’t you write yours using the following template?

:warning: Save it as ~/kraken.sh so that we can check it!

1
2
3
4
5
6
DB={fill_this}
for READS_FOR in ~/kraken-ws/filt/*_R1*.gz;
do
    READS_REV=${READS_FOR/_R1/_R2}
    kraken2 {fill_this}
done

:bulb: In the next paragraphs you’ll find a complete script that does Kraken and Bracken in one go, but you can see a Kraken2 solution here.

Bracken

Bracken will use as input the Kraken2 report (not the tabular output), and needs to know the read length of our library (we can check it with SeqFu stats, for example).

The general syntax is:

1
bracken -d MY_DB -i INPUT -o OUTPUT -w OUTREPORT -r READ_LEN -l LEVEL -t THRESHOLD

Where:

  • MY_DB is the database, that should be the same used for Kraken2 (and adapted for Bracken)
  • INPUT is the report produced by Kraken2
  • OUTPUT is the tabular output, while OUTREPORT is a Kraken style report (recalibrated)
  • LEVEL is the taxonomic level (usually S for species)
  • THRESHOLD it’s the minimum number of reads required (default is 10)

Run bracken on one of the samples, and check the output files produced.

A combined script for Kraken and Bracken

To put together some of the information written in the Bash scripting notes, we can create a classify.sh script to process a directory of paired-end reads and produce two output folders (in the directory where the script was invoked):

  • kraken (containing the reports as *.report and raw output as *.tsv)
  • bracken (containing the reports as *.breport and tabular output as *.tsv)

We hosted the source on GitHub, and you can quickly download it as:

1
wget "https://gist.githubusercontent.com/telatin/fa79d013707a293c0c3ff019abc7313d/raw/kraken.sh"

and run it as:

1
bash classify.sh [input_dir]

Visualization and reports

MultiQC

MultiQC is a fantastic tool that can aggregate outputs from different bioinformatics programs in a single report. We will combine our fastp and Kraken2 classifications to have a single report.

Krona plots

Krona is a flexible tool to generate interactive pie charts. We have a dedicated tutorial on how to produce an HTML interactive plot using Krona.

The procedure works both on Kraken2 and Bracken report files.

Pavian

Pavian is an R package that will generate a web server in your local machine that you can access from (for example) http://localhost:5000.

Pavian can import Kraken reports, but also Metaphlan and Centrifuge files. If you want to try without installing R, there is a web app.

Combining the reports

Since we have to run kraken2 and bracken on a per-sample basis, it is helpful to combine the report files into a single table containing all observations and species before we jump into R. I wrote a small script that should do the merging of bracken (or kraken2) reports for you.

To use it we need to install pandas first

1
conda install pandas

The you can merge either a kraken2 or bracken reports with a custom script, as long as you supply an input directory with a set of *.report files.

1
2
# Check you have the files you expect
ls -l bracken/*.report
1
python /data/shared/scripts/merge_profiling_reports.py -i bracken/ -o merged

:package: merge_profiling_reports.py source

This produces 2 files in the same directory where the input files are (in our example ./bracken):

  • merged_rel_abund.csv: contains table for all samples with bracken relative abundances and taxonimic assignments
  • merged_read_numbers.csv: contains table for all samples with bracken read counts and taxonimic assignments

We will use these files for the data exploration and analysis on day 3.

But feel free to sneak peak with head merged_rel_abund.csv.

Get an overview for the number of host and unclassified reads

One information that get’s lost when running bracken, is the number of unclassified reads. However, this is a very important information when investigating (and comparing) samples. To get this data, let’s merge the kraken reports in the same way we merged the bracken reports.

1
python /data/shared/scripts/merge_profiling_reports.py -i ~/kraken-ws/kraken/ -o merged

Then we extract the read counts of unclassified and those assigned to ‘root’ from the kraken2 reports

1
head -3 ~/kraken-ws/kraken/merged_read_numbers.csv > ~/kraken-ws/kraken/classification_counts.csv

Now same thing for the host removal output from day 1

1
2
3
4
5
python /data/shared/scripts/merge_profiling_reports.py \
  -i ~/kraken-ws/reads-no-host/ \
  -o merged_hostremoval

head -3 ~/kraken-ws/reads-no-host/merged_hostremoval_read_numbers.csv > ~/kraken-ws/reads-no-host/hostread_counts.csv

We will explore these files tomorrow.

:bulb: Optional track: Kaiju