Taxonomic profiling of whole metagenome shotgun (day 1)
On our first day we well cover the concepts behind taxonomic classification using Kraken2 (and Bracken), and see how to remove host reads and perform the quality checks (and filtering).
Some links, before we start
This is an intermediate workshop and some knowledge of Unix is required, but we will be at different levels so… here some usful links:
- Log in via SSH
- Using GNU screen
- A Unix CLI tutorial (basic)
- A primer on bash scripting (intermediate)
Preparing our workbench
Installing Miniconda screencast
We will use Miniconda throughout this workshop. This is generally a great tool to install packages and manage conflicting/multiple versions of libraries and tools, but in this case it’s also the required way to install Qiime2.
If we don’t have conda installed and available, we first need to install Miniconda and then mamba.
When we are ready with conda (and mamba) installed, we can install a some of the of the tools which we’ll use later:
1
mamba install -c conda-forge -c bioconda seqfu
We can create an ad hoc environment, called for example “metax” (here we split the command on multiple lines):
1
2
3
mamba create -n metax \
-c conda-forge -c bioconda \
seqkit kraken2 bracken entrez-direct krona fastp fastqc multiqc
To be able to use the tools, you will need to activate the environment with conda activate metax
. Do deactivate
it, simply conda deactivate
.
Problems installing conda? We are here to help!
Reclaim your home
Write your name in a file so that we know who is using each account:
1
echo "Name Surname" > ~/.name
Our dataset
We have a set of samples from an ongoing study gut metagenomics by Aimee Parker (Quadram Institute) and co-workers. Part of her study has been described in a pre-print, we took some samples to practice our classification skills. This is a small selection of the dataset and we also subsampled the total reads to make the computation faster.
- Sample_3 (FastQC R1, FastQC R2)
- Sample_6 (FastQC R1, FastQC R2)
- Sample_30 (FastQC R1, FastQC R2)
- Sample_4 (FastQC R1, FastQC R2)
- Sample_22 (FastQC R1, FastQC R2)
- Sample_25 (FastQC R1, FastQC R2)
- Sample_13 (FastQC R1, FastQC R2)
- Sample_31 (FastQC R1, FastQC R2)
We have the reads in /data/workshop/reads
. You can use that source directory directly, or - for simplicity - make
a symbolic link of that directory in your home, as shown below:
1
2
3
cd
mkdir kraken-ws
ln -s /data/workshop/reads/ kraken-ws
Initial QC
A common approach to have a “snapshot” of the quality of the reads is FastQC, that produces HTML reports for each analysed file. From the command line can be invoked as:
1
2
mkdir -p FastQC
fastqc --outdir FastQC --threads 4 reads/*_R[1,2]*gz
To view the HTML file you will need to copy the to your computer using scp (natively supported by Mac and Linux) or a program supporting the protocol, like WinSCP.
Sometimes it can be useful to have some data on our files from the command line:
- We can count the reads with
seqfu count reads/*gz
- We can get insights on the quality scores with
seqfu qual reads/*gz
- We can even inspect a single file with
seqfu view reads/Sample3_R2.fq.gz | less -SR
(remember to useq
to quit less)
Host removal
We will use kraken2, the same program we will use to classify our reads, also to flag (and remove) the host reads.
First create a directory for your decontaminated output in your workshop directory:
1
mkdir ~/kraken-ws/reads-no-host
To view the settings of kraken2 you can run kraken2 -h
.
Now can try to run the host decontamination for one sample (e.g. Sample8).
Make sure you have the conda environment with your kraken2 installation activated.
1
2
3
4
5
kraken2 --db /data/db/kraken2/mouse_GRCm39 --threads 5 --confidence 0.5 \
--minimum-base-quality 22 \
--report ~/kraken-ws/reads-no-host/Sample8.report \
--unclassified-out ~/kraken-ws/reads-no-host/Sample8#.fq \
--paired /data/shared/reads/Sample8_R1.fq.gz /data/shared/reads/Sample8_R2.fq.gz > ~/kraken-ws/reads-no-host/Sample8.txt
-
/data/db/kraken2/mouse_GRCm39
is the path to the kraken2 database for host (see optional part below to see how you can create your own custom database) -
~/kraken-ws/reads-no-host/
is out output directory -
Sample8
is our sample name -
--report ~/kraken-ws/reads-no-host//Sample8.report
is going to create the report that describes the amount of contamination -
--unclassified-out ~/kraken-ws/reads-no-host//Sample8#.fq.gz
is needed to collect the reads that don’t match the host database, hence represent contamination-free reads -
~/kraken-ws/reads-no-host/Sample8.txt
will save the read-by-read report. you usually will discard it so you can redirect to/dev/null
instead
The kraken2 output will be unzipped and therefore taking up a lot iof disk space. So best we gzip the fastq reads again before continuing.
1
pigz -p 6 ~/kraken-ws/reads-no-host/Sample8_*.fq
Since we have multiple samples, we need to run the command for all reads. This can be done using a for-loop. Save the following into a script removehost.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
DB=/data/db/kraken2/mouse_GRCm39
DIR=/data/workshop/reads/
OUT=~/kraken-ws/reads-no-host
mkdir -p "$OUT/"
for R1 in $DIR/Sample*_R1.fq.gz;
do
# We cut the sample name on the first _ and store it in $sample
sample=$(basename $R1 | cut -f1 -d_)
# We infer the second pair replacing _R1 with _R2
R2=${R1/_R1/_R2}
echo "removing mouse contamination from $sample"
echo "Reads: $R1 / $R2"
kraken2 \
--db $DB \
--threads 8 \
--confidence 0.5 \
--minimum-base-quality 22 \
--report $OUT/${sample}.report \
--unclassified-out $OUT/${sample}#.fq > /dev/null \
--paired $R1 $R2
pigz ${OUT}/*.fq
done
Now run the script with
1
bash removehost.sh
See Host removal
Reads filtering
fastp was designed to have “good defaults”, but it’s always a good idea to check what our options are to tweak the parameters.
Some notes:
- When processing paired-end reads the input is specified with
-i file_R1.fq
and-I file_R2.fq
. - Similarly, the two output files are specified with
-o filtered_R1.fq
and-O filtered_R2.fq
. - During the process, a report can be saved in both HTML (
-h report.html
) and JSON (-j report.json
) formats. - All the other paramters will affect what files are retained / discarded
Running fastp over a single sample
Let’s first try with a single sample, where we ask to perform an automatic adapter detection on the paired-end
reads (it’s enabled by default for single-end datasest)
and requiring a minimum length after filtration (-l INT
):
1
2
3
4
5
cd ~/kraken-ws/
mkdir fastp-test
fastp -i reads-no-host/Sample3_1.fq.gz -I reads-no-host/Sample3_2.fq.gz \
-o fastp-test/Sample3_R1.fq.gz -O fastp-test/Sample3_R2.fq.gz \
-l 100 --detect_adapter_for_pe
Note that Kraken will split the paired reads in files tagged as _1
and _2
.
We are restoring the more common _R1
and _R2
nomenclature with the output of fastp.
We should find the output FASTQ files in the fastp-test
subdirectory.
To quickly check the amount of reads before and after filtering:
1
seqfu stats --nice {reads,reads-no-host,fastp-test}/Sample3_*1.fq.gz
This will print a screen-friendly table.
From the table we can see the reads-loss and also the effect on the read length:
1
2
3
4
5
6
7
┌───────────────────────────────┬─────────┬───────────┬───────┬─────┬─────┬─────┬───────┬─────┬─────┐
│ File │ #Seq │ Total bp │ Avg │ N50 │ N75 │ N90 │ auN │ Min │ Max │
├───────────────────────────────┼─────────┼───────────┼───────┼─────┼─────┼─────┼───────┼─────┼─────┤
│ reads/Sample3_R1.fq.gz │ 1000000 │ 150000000 │ 150.0 │ 150 │ 150 │ 150 │ 0.000 │ 150 │ 150 │
│ reads-no-host/Sample3_1.fq.gz │ 989295 │ 148394250 │ 150.0 │ 150 │ 150 │ 150 │ 0.000 │ 150 │ 150 │
│ fastp-test/Sample3_R1.fq.gz │ 915797 │ 135443988 │ 147.9 │ 150 │ 150 │ 150 │ 0.006 │ 100 │ 150 │
└───────────────────────────────┴─────────┴───────────┴───────┴─────┴─────┴─────┴───────┴─────┴─────┘
By default, fastp also saves a report (called fastp.html) in the current directory. An example report allows to see that in a single report we have data on the reads before and after the quality filter.
Processing multiple samples with a script
To process multiple samples we can make a for loop, but it’s wise to make a script instead of typing the whole command (if we make an error, we can quickly fix it, and we keep track of our command):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
mkdir -p filt
mkdir -p reports
for i in reads-no-host/*_1*gz;
do
echo "Processing sample $i";
fastp -w 8 -i $i -I ${i/_1/_2} \
-o filt/$(basename ${i/_1/_R1}) -O filt/$(basename ${i/_1/_R2}) \
-h reports/$(basename $i | cut -f 1 -d _).html -j reports/$(basename $i | cut -f 1 -d _).json \
--detect_adapter_for_pe \
--length_required 100 \
--overrepresentation_analysis;
done
To run the script:
1
bash filter.sh
If you arrived at this point, you deserve your coffee and an emoji-medal:
If you have spare time and want to experiment something more: