Andrea Telatin
Andrea Telatin Senior bioinformatician at the Quadram Institute Bioscience, Norwich.

Profiling viromes with Phanta

Profiling viromes with Phanta

Phanta profiling requires these steps:

  1. Creating an environment with the required dependencies (only once)
  2. Cloning the Phanta repository (only once)
  3. Downloading the database (only once)
  4. Generating a “Sample Sheet” file
  5. Running Phanta

We already described the first two steps in the previous post, and the repository of the tool explains well the whole process too.

Setup

:warning: See Ebame specific notes and skip this section

Phanta’s installation is not automated, but all its dependencies can be installed using an environment file for Miniconda.

1
2
3
4
# Get the Phanta repository and make its environment
git clone https://github.com/bhattlab/phanta.git
mamba env create -n phanta_env --file phanta/phanta_env.yaml
conda activate phanta_env

Databases have to be downloaded from Dropbox, and stored in a convenient location.

1
2
3
4
5
# Download the database, keep track of the location
mkdir -p phanta_dbs/default_V1
cd phanta_dbs/default_V1
wget https://www.dropbox.com/sh/3ktsdqlcph6x95r/AACGSj0sxYV6IeUQuGAFPtk8a/database_V1.tar.gz
tar xvzf database_V1.tar.gz

How to run Phanta (the manual way)

The command looks like this:

1
2
3
REPO_DIR=/path/to/phanta_repo/
snakemake -s $REPO_DIR/Snakefile \
   --configfile "config.yaml" --jobs 99 --cores 1 --max-threads 90

So what do we need to put in the config.yaml file? Some boilerplate (as found in the template file) and the following paths:

1
2
3
4
database: /path-to/phanta_dbs/default_V1
outdir: /path-to//phanta-out
pipeline_directory: /path-to/phanta_repo
sample_file:  /path/to/metadata.txt

Some elements are ready to go:

  • Database: full path too the default_V1 directory as downloaded during the setup
  • Outdir: the desired output directory (create it first)
  • Pipeline directory: simply the full path to phanta’s repository

For the sample file, you can generate one with SeqFu:

1
2
3
# SeqFu can be installed via:
# conda install -c bioconda -c conda-forge seqfu
seqfu metadata /path/to/reads | grep -v sample-id > /path/to/metadata.txt

Autorun

The previous command can be automated by creating a script that runs the command for us, but it’s important to check that the environment is activated.

  1. Get the script
1
2
wget -O $HOME/bin/runPhanta.py "https://gist.githubusercontent.com/telatin/4f404fc7d677a73d662d3d9c80021ea4/raw/1631ad6d8b7b5d3df5a6d3ca13f427580b43e5b8/run-phanta.py"
chmod +x $HOME/bin/runPhanta.py
  1. Check that the script runs:
1
runPhanta.py -h
  1. Ensure you have the $PHANTA_DIR and $PHANTA_DB environment variables set:
1
2
echo "Phanta is in ${PHANTA_DIR:=NOT_INSTALLED}"
echo "Phanta database is in ${PHANTA_DB:=NOT_INSTALLED}"
  1. Check you input directory (with ls), to see the tag denoting forward and reverse reads.

By default the program assumes the tags are _1 and _2, but you can change them with the -f and -r options, if the reads are named differently (for example _R1 and _R2).

  1. Run the program:

The basic syntax of the wrapper is:

1
runPhanta.py -i ${VIR}/dataset-full/ -c 16 -o ~/phanta-out --verbose

Where:

  • -i is the input directory (where the reads are)
  • -o is the output directory (where the results will be stored)
  • -c is the number of cores to use
  • --verbose will keep us informed of the progress

The output

The output is a directory with two subdirectories:

  • classification: the results from the taxonomy profiling of each sample
  • final_merged_outputs: the combined tables
    • counts.txt: provides the number of fragments assigned to each taxon
    • relative_read_abundance.txt: same but normalized out of the total number of reads
    • relative_taxonomic_abundance.txt: same but abundances are corrected for genome length
    • total_reads.tsv: a table with the total number of reads per sample, useful for normalization purposes

The output files can be filtered at a desired taxonomic level with scripts provided in the repository.


The programme

  • :zero: EBAME-22 notes: EBAME-7 specific notes
  • :one: Gathering the reads: downloading and subsampling reads from public repositories (optional)
  • :two: Gathering the tools: we will use Miniconda to manage our dependencies
  • :three: Reads by reads profiling: using Phanta to quickly profile the bacterial and viral components of a microbial community
  • :four: De novo mining: assembly based approach, using VirSorter as an example miner
  • :five: Viral taxonomy: ab initio taxonomy profiling using vConTACT2
  • :six: MetaPhage overview: what is MetaPhage, a reads to report pipeline for viral metagenomics

:arrow_left: Back to the main page