Andrea Telatin Follow Senior bioinformatician at the Quadram Institute Bioscience, Norwich.

Get FASTQ reads from NCBI: an automated workflow

The problem

The NCBI SRA database is a treasure trove of sequencing data, but it is not always easy to download the data you need. I n this post we will see how to automate the download of FASTQ files from the SRA database, using a Nextflow pipeline called getreads.

Being written in Nextflow it can handle for us:

the dependencies (either via conda or docker)
the processes (tasks can run in parallel and even on a distributed cluster if available)

The solution

The getreads pipeline is available on GitHub but can be installed automatically via Nextflow.

What you need is a machine with:

Nextflow installed (can be installed via conda)

conda install --yes -c bioconda -c conda-forge nextflow

Either conda or docker installed, to handle the dependencies
- Try typing docker --version to check if you have docker installed: it’s the preferred way to run the pipeline
- Try typing conda --version to check if you have conda installed: it’s the fallback option if you don’t have docker

How to use the pipeline

Create a text file with a list of SRA accessions (one per line), for example:

SRR19440534
SRR19440543

Run the pipeline like:

nextflow run telatin/getreads -r main --list list.txt -profile docker

This step - when run for the first time - will download the pipeline and all the dependencies, and then will execute the pipeline on the data in list.txt.

The pipeline will create a folder called getreads with the pipeline outpout files (can be changed passing the --outdir DirectoryName option). A subfolder called reads will contain the FASTQ files.

If you don’t have Docker installed, type -profile conda instead

Other parameters

The pipeline can be customized with the following parameters:

--max_cpus: maximum number of CPUs to use (default: 8)
--max_memory: maximum amount of memory to use (default: 16 GB)

See the GitHub page for more information

12 Dec 2022

tutorial

« Word distribution, an example project for Python beginners Mounting CLIMB S3 buckets in a Linux Virtual Machine (VM) »

Microbiome binfies

Get FASTQ reads from NCBI: an automated workflow

The problem

The solution

How to use the pipeline

Other parameters

Explore →