De novo assembly of viromes
To identify putative viral sequences without using a classifier, we can use a de novo assembly approach. We first need to perform a standard metagenome assembly, then we need to process the contigs using a viral miner, a tool aiming at identifying viral sequences in the assembly.
Assembly
MegaHit is a fast assembler that also handle metagenomic datasets. We can install it with conda, even in a dedicated environment such as:
1
mamba create -n denovo -c conda-forge -c bioconda --yes megahit seqfu quast
To assemble a single dataset, the command is like:
1
2
# Using 16 cores
megahit -1 reads_1.fastq.gz -2 reads_2.fastq.gz -o assembly -t 16
Choose a single sample and assemble it with MegaHIT.
For the records, we write here a prototype of script with a loop to assemble a set of paired-end reads, where we iterate over the forward reads and we will infer the name of the reverse reads:
1
2
3
4
5
6
7
8
9
10
11
12
# Modify this script according to your needs
FWD_TAG="_1.fq.gz" # Suffix specific to R1 reads
REV_TAG="_2.fq.gz" # Suffix specific to R2 reads
READS=/path/to/reads # Path to reads, READS=$VIR/dataset (For EBAME7)
mkdir -p assembly # Where to save the output
for R1 in $READS/*${FWD_TAG};
do
R2=${R1/$FWD_TAG/$REV_TAG} # Infer file R2
SAMPLE=$(basename ${R1/$FWD_TAG/}) # Extract sample name
echo Processing $SAMPLE
megahit -1 $R1 -2 $R2 -o assembly/$SAMPLE -t 16
done
Assembly metrics
You can quickly gather an overview of your assemblies using SeqFu stats:
1
seqfu stats -n -t assembly/*/final.contigs.fa
Or you can use Quast to get a more detailed report.
Viral mining
There are several tools that can be used to identify viral sequences in a metagenomic assembly. In this tutorial we focus on two tools: VirSorter2, a complete pipeline, and VirFinder, an R package.
VirSorter 2 (EBAME)
VirSorter 2 comes with a good documentation.
Setup
VirSorter can be easily installed from BioConda. If you didn’t do it already, create a “vs2” environment
with virstorter=2
and activate it.
The first step is to download the database:
1
2
3
4
5
6
# Install the package
mamba create -n virsorter2 -c conda-forge -c bioconda --yes virsorter=2
conda activate virsorter2
# One time database download [this may have been done for you in training servers]
virsorter setup -d $OUTPUT_DB_PATH -j 4
Running VirSorter
When the database is installed, you can run VirSorter on a single assembly:
1
virsorter run -w $OUT -d $VIR/virsorter2/ -i $CONTIGS -j 16 [--use-conda-off]
where:
-
-w OUTPUT_DIR
is the output directory path -
-d PATH_DB
is the path to the VirSorter2 database and environments -
-i CONTIGS_FASTA
is the fasta file produced by the assembler -
-j THREADS
is the number of jobs to run in parallel -
--use-conda-off
will allow to work with offline servers (requires a different installation)
(parallel) VirFinder
We can install a parallelised version of VirFinder using conda:
1
mamba install -c bioconda -c conda-forge parallel-virfinder
to use it:
1
parallel-virfinder.py -i CONTIGS -o OUTPUT -f FASTA_OUTPUT -n 16 [-s MIN_SCORE -p MAX_P_VALUE]
MIN_SCORE and MAX_P_VALUE are supplied with defaults values, but you can change them if you want.
-
-i CONTIGS
is the fasta file produced by the assembler (input) -
-f OUTPUT
is the set of viral contigs (output) -
-o OUTPUT
is the tabular output
Checking the results
1
2
3
4
5
# In the base environment
mamba install -c bioconda checkv
# Download the databases (a subdirectory will be created)
checkv download_database ~/checkv-dbs/
To run checkv:
1
checkv end_to_end -d ~/checkv-db/checkv-db-v1.4/ -t 16 input-contigs.fa output-dir
The output directory includes:
- viruses.fna
- proviruses.fna
- complete_genomes.tsv: detailed overview of putative complete genomes identified
- quality_summary.tsv: report on the program’s modules for each contig
For example, when running on the output of VirSorter2 on one sample in the dataset, CheckV discarded 30% of the contigs:
1
2
3
4
5
6
┌──────────────────────┬──────┬───────────┬──────────┬────────┬───────┬───────┬───────────┬─────┬─────────┐
│ File │ #Seq │ Total bp │ Avg │ N50 │ N75 │ N90 │ auN │ Min │ Max │
├──────────────────────┼──────┼───────────┼──────────┼────────┼───────┼───────┼───────────┼─────┼─────────┤
│ final-viral-combined │ 745 │ 2,684,355 │ 3,603.16 │ 19,811 │ 3,156 │ 1,197 │ 28,566.17 │ 228 │ 124,265 │
│ viruses │ 716 │ 1,833,389 │ 2,560.60 │ 6,124 │ 1,956 │ 911 │ 17,494.67 │ 228 │ 70,755 │
└──────────────────────┴──────┴───────────┴──────────┴────────┴───────┴───────┴───────────┴─────┴─────────┘
The programme
- EBAME-22 notes: EBAME-7 specific notes
- Gathering the reads: downloading and subsampling reads from public repositories (optional)
- Gathering the tools: we will use Miniconda to manage our dependencies
- Reads by reads profiling: using Phanta to quickly profile the bacterial and viral components of a microbial community
- De novo mining: assembly based approach, using VirSorter as an example miner
- Viral taxonomy: ab initio taxonomy profiling using vConTACT2
- MetaPhage overview: what is MetaPhage, a reads to report pipeline for viral metagenomics