Build a custom host database for Kraken2
In our workshop we proivided a kraken2 database for you to use. However, most of the times, you would need to create a database for your own host. For the creation of a human database kraken to already provides pre-processed databases. But sometimes you need to build a custom database. Here we can practise the corona virus genome which is small enough to keep computation times and storage space minimal.
Custom host (example: corona virus)
If your host is not included in kraken2 databases, this is a little bit more complicated.
You need to provide your host genome in fasta format.
We downloaded the corona virus genome for you to try the database creation here git
Very important is to add the NCBI taxid for your genome to contig names like |kraken:taxid|2697049
. To do this we create a folder coronaDB
to save the modified genome.
1
mkdir ~/coronaDB
Then we add the taxid. We can use seqfu for this:
1
seqfu cat --append "|kraken:taxid|2697049" /data/shared/db-genome/NC_045512.2.fasta.gz > ~/coronaDB/NC_045512.2_taxid.fasta
Let’s check that the taxid was indeed appended:
1
grep ">" ~/coronaDB/NC_045512.2_taxid.fasta
Now you need to add your fasta file to a new database which we named coronaDB
1
kraken2-build --add-to-library ~/coronaDB/NC_045512.2_taxid.fasta --db ~/coronaDB --threads 4
Next we tell kraken2 to build the taxonomy (this will take a few minutes)
1
kraken2-build --download-taxonomy --db ~/coronaDB
And now, we build the database
1
kraken2-build --build --db ~/coronaDB
Finally, we need to clean up a little
1
kraken2-build --clean --db ~/coronaDB
Yay! Now you are ready to use this database for your host decontamination using your own database.