Profiling viromes with Phanta
Phanta profiling requires these steps:
- Creating an environment with the required dependencies (only once)
- Cloning the Phanta repository (only once)
- Downloading the database (only once)
- Generating a “Sample Sheet” file
- Running Phanta
We already described the first two steps in the previous post, and the repository of the tool explains well the whole process too.
Setup
See Ebame specific notes and skip this section
Phanta’s installation is not automated, but all its dependencies can be installed using an environment file for Miniconda.
1
2
3
4
# Get the Phanta repository and make its environment
git clone https://github.com/bhattlab/phanta.git
mamba env create -n phanta_env --file phanta/phanta_env.yaml
conda activate phanta_env
Databases have to be downloaded from Dropbox, and stored in a convenient location.
1
2
3
4
5
# Download the database, keep track of the location
mkdir -p phanta_dbs/default_V1
cd phanta_dbs/default_V1
wget https://www.dropbox.com/sh/3ktsdqlcph6x95r/AACGSj0sxYV6IeUQuGAFPtk8a/database_V1.tar.gz
tar xvzf database_V1.tar.gz
How to run Phanta (the manual way)
The command looks like this:
1
2
3
REPO_DIR=/path/to/phanta_repo/
snakemake -s $REPO_DIR/Snakefile \
--configfile "config.yaml" --jobs 99 --cores 1 --max-threads 90
So what do we need to put in the config.yaml
file? Some boilerplate
(as found in the template file)
and the following paths:
1
2
3
4
database: /path-to/phanta_dbs/default_V1
outdir: /path-to//phanta-out
pipeline_directory: /path-to/phanta_repo
sample_file: /path/to/metadata.txt
Some elements are ready to go:
- Database: full path too the
default_V1
directory as downloaded during the setup - Outdir: the desired output directory (create it first)
- Pipeline directory: simply the full path to phanta’s repository
For the sample file, you can generate one with SeqFu:
1
2
3
# SeqFu can be installed via:
# conda install -c bioconda -c conda-forge seqfu
seqfu metadata /path/to/reads | grep -v sample-id > /path/to/metadata.txt
Autorun
The previous command can be automated by creating a script that runs the command for us, but it’s important to check that the environment is activated.
- Get the script
1
2
wget -O $HOME/bin/runPhanta.py "https://gist.githubusercontent.com/telatin/4f404fc7d677a73d662d3d9c80021ea4/raw/1631ad6d8b7b5d3df5a6d3ca13f427580b43e5b8/run-phanta.py"
chmod +x $HOME/bin/runPhanta.py
- Check that the script runs:
1
runPhanta.py -h
- Ensure you have the
$PHANTA_DIR
and$PHANTA_DB
environment variables set:
1
2
echo "Phanta is in ${PHANTA_DIR:=NOT_INSTALLED}"
echo "Phanta database is in ${PHANTA_DB:=NOT_INSTALLED}"
- Check you input directory (with
ls
), to see the tag denoting forward and reverse reads.
By default the program assumes the tags are _1
and _2
, but you can change them with the -f
and -r
options,
if the reads are named differently (for example _R1
and _R2
).
- Run the program:
The basic syntax of the wrapper is:
1
runPhanta.py -i ${VIR}/dataset-full/ -c 16 -o ~/phanta-out --verbose
Where:
-
-i
is the input directory (where the reads are) -
-o
is the output directory (where the results will be stored) -
-c
is the number of cores to use -
--verbose
will keep us informed of the progress
The output
The output is a directory with two subdirectories:
- classification: the results from the taxonomy profiling of each sample
-
final_merged_outputs: the combined tables
- counts.txt: provides the number of fragments assigned to each taxon
- relative_read_abundance.txt: same but normalized out of the total number of reads
- relative_taxonomic_abundance.txt: same but abundances are corrected for genome length
- total_reads.tsv: a table with the total number of reads per sample, useful for normalization purposes
The output files can be filtered at a desired taxonomic level with scripts provided in the repository.
The programme
- EBAME-22 notes: EBAME-7 specific notes
- Gathering the reads: downloading and subsampling reads from public repositories (optional)
- Gathering the tools: we will use Miniconda to manage our dependencies
- Reads by reads profiling: using Phanta to quickly profile the bacterial and viral components of a microbial community
- De novo mining: assembly based approach, using VirSorter as an example miner
- Viral taxonomy: ab initio taxonomy profiling using vConTACT2
- MetaPhage overview: what is MetaPhage, a reads to report pipeline for viral metagenomics