seqfu tofasta

Converts various sequence file formats to FASTA format.

Introduced in SeqFu 1.23.0

A versatile format converter supporting multiple bioinformatics file formats including sequence alignments, genome annotations, and assembly graphs. The tool automatically detects the input format and converts sequences to standard FASTA format.

Replicates the approach of any2fasta, but it’s faster.

Usage: tofasta [options] <inputfile>...

Convert various sequence formats to FASTA format.

Options:
  -n, --replace-iupac    Replace non-IUPAC characters with 'N'
  -l, --to-lowercase     Convert sequences to lowercase
  -u, --to-uppercase     Convert sequences to uppercase
  -o, --output FILE      Write output to FILE (default: stdout)
                         Note: checks for duplicate IDs across all files
  -v, --verbose          Print progress information to stderr
  -h, --help             Show this help

Supported Formats

seqfu tofasta automatically detects and converts the following formats:

Format Description Notes
FASTA Standard FASTA format Pass-through with optional transformations
FASTQ Sanger, Illumina, Solexa Quality scores are discarded
GenBank NCBI GenBank flat file Extracts sequences from ORIGIN section
EMBL EMBL-Bank format Extracts sequences from SQ section
GFF Generic Feature Format Extracts sequences after ##FASTA directive
GFA Graphical Fragment Assembly Extracts sequences from S (segment) lines
Clustal Clustal W alignment Preserves gaps in sequences
Stockholm Stockholm alignment Converts ‘.’ gaps to ‘-‘

Options

-n, --replace-iupac Replaces non-standard IUPAC characters with ‘N’. Standard IUPAC codes (A, T, G, C, N) are preserved, while ambiguous codes (R, Y, W, S, K, M, etc.) are replaced with ‘N’.

-l, --to-lowercase Converts all sequence characters to lowercase. Useful for soft-masking or compatibility with tools that expect lowercase input.

-u, --to-uppercase Converts all sequence characters to uppercase. Takes precedence over -l if both are specified.

-o, --output FILE Writes all sequences to a single output file instead of stdout. When using this option, the tool performs duplicate ID checking across all input files and will exit with an error if duplicate sequence IDs are found.

-v, --verbose Prints progress information to stder

Examples

Basic Format Conversion

Convert a GenBank file to FASTA:

seqfu tofasta genome.gbk > genome.fasta

Convert a FASTQ file (quality scores are discarded):

seqfu tofasta reads.fastq.gz > reads.fasta

Multiple Files

Process multiple files and combine into one FASTA file:

seqfu tofasta -o combined.fasta file1.gbk file2.gff file3.fastq.gz

Replace ambiguous IUPAC codes with N:

seqfu tofasta -n -u sequences.fasta > clean.fasta

Input Handling

  • Gzip Support: All input files can be gzip-compressed (.gz extension)
  • No STDIN: Currently, tofasta requires file arguments and does not support reading from standard input
  • Format Detection: File format is automatically detected from content, not file extension
  • Error Handling: The tool is strict and will exit with an error on unknown or malformed formats

Duplicate ID Detection

When using -o/--output, the tool checks for duplicate sequence IDs across all input files:

# This will fail if seq1.gbk and seq2.gbk contain sequences with the same ID
seqfu tofasta -o combined.fasta seq1.gbk seq2.gbk

Error message example:

ERROR: Duplicate sequence ID found: NZ_12345
  First occurrence in a previous file
  Second occurrence in: seq2.gbk

Format-Specific Behavior

GenBank/EMBL:

  • Extracts accession number as sequence ID
  • Extracts sequence from ORIGIN/SQ section
  • Handles multiple records in a single file

GFF:

  • Only processes sequences after ##FASTA directive
  • Ignores feature annotations
  • Preserves sequence IDs from FASTA headers

Clustal/Stockholm:

  • Preserves alignment gaps in output
  • Concatenates sequence fragments if alignment spans multiple blocks
  • Stockholm ‘.’ gaps are converted to ‘-‘

GFA:

  • Only processes S (segment) lines
  • Uses segment name as sequence ID
  • Ignores paths, links, and other graph elements