Andrea Telatin Follow Senior bioinformatician at the Quadram Institute Bioscience, Norwich.

A simple workflow with USEARCH

What is USEARCH

USEARCH is a popular package for metabarcoding analyses developed by Robert Edgar, and (partially) described in a set of papers.

This workflow allows to have a more direct contact with each intermediate file. You are highly encouraged to check, inspect and manipulate each output file.

We assume:

You downloaded the raw reads (“Mothur SOP”)
You performed a first QC and evaluated the number of reads per sample See the day1 page for more details.

Preprocessing

USEARCH does not accept compressed files as input, so you will need to gunzip any compressed files.

Merge the paired ends

The first step is to merge the paired ends with fastq_mergepairs.

usearch -fastq_mergepairs reads/*R1* -relabel @ -fastq_maxdiffs 20 -fastqout merge.fq -threads 12

We can check the average merged read size with SeqFu:

seqfu stats --nice merge.fq

check (for example using seqfu head and seqfu tail) that the reads have been relabeled prepending the sample name.

Quality filter

To remove low quality reads, we can use fastq_filter. Here we set a maximum number of expected errors (calculated using the quality scores), and a minimum length (in this case from the hypothesis that 16S is very conserved in length and a big variation is usually due to errors).

usearch -fastq_filter merge.fq -relabel filt -fastq_maxee 0.7 --fastq_minlen 200 -fastq_maxns 0 -fastaout filtered.fa -threads 12

USEARCH always prints detailed statistics, but try comparing the number of merged reads with the number of filtered reads, for example with seqfu stats -n merge.fq filtered.fa.

Dereplication (unique)

We need to discard the duplicate reads with fastx_uniques, but we must keep track of how many duplicates each read had (-sizeout):

usearch -fastx_uniques filtered.fa -fastaout uniq.fa -sizeout

Check, for example with seqfu head ..., that the unique sequences have the “size” (i.e. how many identical sequences have been found) in the sequence name.

Representative sequences (ASVs)

USEARCH famously offers a clustering algorithm, but recently ships also a denoising method called UNOISE3: we can either perform an OTU picking or an ASV detection. We will try the latter.

usearch -unoise3 uniq.fa -zotus asv.fa

How many ASVs (or ZOTUs, in USEARCH jargon) have you identified?

OTU Table

Generation of a feature table. This is done mapping the merged reads to the representative sequences, and thus requires that the merged reads were relabeled prepending the sample ID to each read name (see the merging step with -relabel @). This can be a time consuming step, so if possible add the maximum number of threads available.

usearch -otutab merge.fq -db asv.fa -otutabout otutab_raw.tsv -threads 16

What to do next?

USEARCH can be used for:

Taxonomy classification (requires formatting of the database or downloading one)
Diversity analysis
and more…

01 Feb 2021

« A primer on Dadaist2 Metabarcoding workshop (day 2) »

Microbiome binfies