fu-nanotags

Note

Experimental utility. Feedback for this tool is welcome.

Search for tags (one or more sequences) in long reads using Smith-Waterman alignment. The tag has to be at the beginning of the read (specifying the region to scan with --cut INT) or at the end (revserse complemented). If --cut=0 the search is in the full read.

Usage: fu-nanotags [options] -q QUERY [<fastq-file>...]

  Options:
    -q, --query TAGSEQ         Sequence string OR file with the sequence(s) to align against reads
    -s, --showaln              Show graphical alignment
    -c, --cut INT              Cut input reads at INT position [default: 300]
    -x, --disable-rev-comp     Do not scan reverse complemented reads
  
  Alignment options:
    -i, --pct-id FLOAT         Percentage of identity in the aligned region [default: 80.0]
    -m, --min-score INT        Minimum alignment score (0 for auto) [default: 0]
  
  Smith-Waterman parameters:
    -M, --weight-match INT     Match [default: 5]
    -X, --weight-mismatch INT  Mismatch penalty [default: -3]
    -G, --weight-gap INT       Gap penalty [default: -5]

  Other options:
    --pool-size INT            Number of sequences/pairs to process per thread [default: 25]
    -v, --verbose              Verbose output
    -h, --help                 Show this help

Output

The program will print to the standard output the reasd containing the tag, under the specified alignment criteria. A comment will be added to the reads specifying which tag was found (e.g. tags=tag1;tag4).

The program will print to the standard error the number of passing reads per file processed, and the grand total.

Example:

tradis/fastq_1.fq	60.00% (18/30) sequences printed, of which 8 in reverse strand.
tradis/fastq_2.fq.gz	53.75% (2150/4000) sequences printed, of which 949 in reverse strand.
Total	53.80% (2168/4030) sequences printed, of which 957 in reverse strand.

Optimisation

If the tag is 100 bp long and we expect to be at the very beginning (or end) of the read, it's advisable to reduce the --cut INT parameter accordingly to speedup the alingment step (for example, to 110, to account for a small variation).

The current version of the program is single threaded, but a multithreading application will be released.

Example

fu-nanotags  -q tag.fa fastq-reads.fq.gz > passed.fq

To inspect the parameters, add --verbose --showaln, possibly redirecting the output to less -S for a preliminary inspection:

fu-nanotags -q tag.fa reads.fq.gz --verbose --showaln 2>&1 | less -S

A fraction of the output is like the following:

# a564e10b-c82e-4e59-98a4-fdc6f1b31acb:test-tag strand=-;score=167;pctid=94.57%
 < AATGATA-TGCGACCACTGAGATCTACACCTCTCTATACACTC-TT-CCTACACGACGCTCTTCCGATCTTTCGTACGTGAGTTTAAATGTATTTGGCTAAGGTGTATGTAAACTTCCGACTTCAACTG
 < |||||||  |||||||| ||||||||||||||||||||||||| || ||||||||||||||||||||||  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 < AATGATACGGCGACCACCGAGATCTACACCTCTCTATACACTCTTTCCCTACACGACGCTCTTCCGATC--TCGTACGTGAGTTTAAATGTATTTGGCTAAGGTGTATGTAAACTTCCGACTTCAACTG
# 132518b1-1522-45ed-9a77-94a3c981ac20:test-tag strand=+;score=531;pctid=90.55%
 > AATGATACGGCGACCACCGAGATCTACACTATCCCTCTACACTCTTTCCCTACACGACGCTCTTCCGATCTACGTACGTGAGTTTAAATGT-GTTAGCTAAGGTGTATAT-AGCTTCCGACTTCAGC
 > |||||||||||||||||||||||||||||  || || |||||||||||||||||||||||||||||||||| |||||||||||||||||||  || |||||||||||| | | |||||||||||| |
 > AATGATACGGCGACCACCGAGATCTACAC-CTCTCTATACACTCTTTCCCTACACGACGCTCTTCCGATCT-CGTACGTGAGTTTAAATGTATTTGGCTAAGGTGTATGTAAACTTCCGACTTCAAC