Build a custom host database for Kraken2
In our workshop we provided a Kraken2 database for you to use. However, most of the time, you would need to create a database for your own host. For the creation of a human database, Kraken2 already provides pre-processed databases. But sometimes you need to build a custom database.
Here we can practice with the coronavirus genome which is small enough to keep computation times and storage space minimal.
Custom host (example: coronavirus)
Let’s create a custom database for the SARS-CoV-2 coronavirus (NCBI RefSeq: NC_045512.2).
Adding taxonomy information
Kraken2 requires NCBI taxonomy IDs in sequence headers to correctly classify reads. The format is |kraken:taxid|TAXID appended to each sequence name. For SARS-CoV-2, the taxid is 2697049.
First, create a directory for the database:
1
2
# Create directory for the custom database
mkdir ~/coronaDB
Then append the taxid to sequence headers using seqfu:
1
2
# Append taxid to sequence headers
seqfu cat --append "|kraken:taxid|2697049" /data/shared/db-genome/NC_045512.2.fasta.gz > ~/coronaDB/NC_045512.2_taxid.fasta
Verify the modification worked:
1
2
# Verify the taxid was added to headers
grep ">" ~/coronaDB/NC_045512.2_taxid.fasta
You should see the header now includes |kraken:taxid|2697049.
Building the database
Add the genome to the database library:
1
2
3
4
5
# Add genome to library
kraken2-build \
--add-to-library ~/coronaDB/NC_045512.2_taxid.fasta \
--db ~/coronaDB \
--threads 4
Download the NCBI taxonomy tree (required for classification):
1
2
# Download NCBI taxonomy
kraken2-build --download-taxonomy --db ~/coronaDB
This downloads taxdump.tar.gz and creates the taxonomy structure.
Build the k-mer database (this is the most time-consuming step):
1
2
# Build the Kraken2 database
kraken2-build --build --db ~/coronaDB --threads 4
This creates the hash table and minimizer database used for classification.
Clean up intermediate files to save disk space:
1
2
# Remove intermediate files to save space
kraken2-build --clean --db ~/coronaDB
This removes downloaded taxonomy files and temporary data, keeping only the final database.
Using your custom database
To classify reads and separate host from non-host sequences:
1
2
3
4
5
6
# Classify reads and remove host sequences
kraken2 --db ~/coronaDB \
--threads 4 \
--unclassified-out host_removed#.fastq \
--classified-out host_classified#.fastq \
--paired input_R1.fastq input_R2.fastq
The # symbol is replaced with _1 and _2 for paired-end reads. Unclassified reads (non-host) go to host_removed_*.fastq, while classified reads (host) go to host_classified_*.fastq.
Notes
- Replace
2697049with your organism’s NCBI taxid (find it at NCBI Taxonomy) - For larger genomes, increase memory and threads accordingly
- The database size scales with genome complexity and k-mer diversity