A first experiment with Kraken2
Our goal here is to create a small and artificial set of sequences to be classified using Kraken2, to practice with its parameters and its output formats.
A known reference
You can select your favourite genome, or use the one we placed in:
1
/data/shared/genomes/GCF_000027325.1_ASM2732v1_genomic.fna.gz
This is the complete genome of Mycoplasma genitalium, that just happen to be a relative small genome,
as confirmed by seqfu stats
(we add -b
to print the filename, and not the path,
and -n
to have a screen-friendly table):
1
seqfu stats -b -n /data/shared/genomes/GCF_000027325.1_ASM2732v1_genomic.fna.gz
Making it into pieces
Sometimes it is useful to simulate a whole genome shotgun, in order to have a synthetic dataset that should resemble a real sequencing experiment. One of such tools, art can generate Illumina datasets.
To test Kraken2, however, we want to start with a systematic approach: shredding the genome into pieces of known length but without sequencing errors and sampling biases. fu-shred, from the seqfu suite, can do exactly that, specifying:
- the desired fragments length (
--length
, or-l
for short) - and their distance (
--step
, or-s
)
1
2
3
# Create a directory for this experiment
mkdir ~/myco-test
fu-shred -l 100 -s 10 /data/shared/genomes/GCF_000027325.1_ASM2732v1_genomic.fna.gz > ~/myco-test/fragments.fq
Classify our dataset with Kraken2
A curated collection of Krakent2 databases is available from BenLangmead lab.
We have some Kraken2 databases in /data/db/kraken2/
, and we are specifically going to use /data/db/kraken2/standard-16gb/
.
If you set the $KRAKEN2_DEFAULT_DB
you can omit the database in the command.
For example with:
1
export KRAKEN2_DEFAULT_DB=/data/db/kraken2/standard-16gb/
the full command to generate both a summary report and the detailed output would be:
1
2
3
kraken2 --db /data/db/kraken2/standard-16gb/ \
--report ~/myco-test/report.txt
--threads 4 ~/myco-test/fragments.fq > ~/myco-test/kraken-output.tsv
The program will print to the standard error, at the end of the execution, a summary of the operation:
1
2
3
57998 sequences (5.80 Mbp) processed in 0.971s (3582.9 Kseq/m, 358.29 Mbp/m).
55100 sequences classified (95.00%)
2898 sequences unclassified (5.00%)
It’s interesting to see that, even if we have subsequences from a reference genome with no errors, a fraction of the fragments has not been classified at all.