seqfu grep

grep is one of the core subprograms of SeqFu.

It can be used to select sequences by their name, comments or sequence using IUPAC degenerate oligo as query.

Usage: grep [options] [<inputfile> ...]

Print sequences selected if they match patterns or contain oligonucleotides

Options:
  -n, --name STRING      String required in the sequence name
  -r, --regex PATTERN    Pattern to be matched in sequence name
  -c, --comment          Also search -n and -r in the comment
  -f, --full             The string or pattern covers the whole name
                         (mainly used without -c)
  -w, --word             The string or pattern is a whole word
                         (only effective with -c, as names do not contain spaces)
  -i, --ignore-case      Ignore case when matching names (is already enabled with regexes)
  -o, --oligo IUPAC      Oligonucleotide required in the sequence,
                         using ambiguous bases and reverse complement
  -A, --append-pos       Append matching positions to the sequence comment
  --max-mismatches INT   Maximum mismatches allowed [default: 0]
  --min-matches INT      Minimum number of matches [default: oligo-length]
  -v, --verbose          Verbose output
  --help                 Show this help

Get sequences by name

In a sequence the name, or id, is the string before the first white space character, while we define as comment all the rest:

>Seq_Name_Here  after the name or ID, everything else is the comment
ATTACAAACAGTCGATCGTAGCTAGCTAGCTGATC

To extract all the sequences containing "Here" in the name:

seqfu grep -n Here file.fasta

If we also want to extend the search to comments we need to add the -c (or --comment) switch:

seqfu grep -c -n extend file.fasta

Finally, regular expressions are supported only enabling -r (or --regex):

seqfu grep -r -n Seq_N..._ file.fasta

Matching patterns in DNA sequences

A simple text search (even with regular expressions) cannot be a handy way to identify matches in a DNA/RNA sequence.

Using the -o (--oligo) parameter, we scan the sequence for matches of oligonucleotides supporting IUPAC degenerate bases, supporting reverse complement matches and partial matches.

>Example
CAGATAAAA

if we scan for TTTT we will match the sequence, as it's in the reverse complement strand:

seqfu -o TTTT file.fasta

We can also use IUPAC bases (N for any base, B for C, G or A…):

seqfu -o TTTTNT file.fasta

Screenshot

Screenshot of