seqfu tofasta
Converts various sequence file formats to FASTA format.
Introduced in SeqFu 1.23.0
A versatile format converter supporting multiple bioinformatics file formats including sequence alignments, genome annotations, and assembly graphs. The tool automatically detects the input format and converts sequences to standard FASTA format.
Replicates the approach of any2fasta, but it’s faster.
Usage: tofasta [options] <inputfile>...
Convert various sequence formats to FASTA format.
Options:
-n, --replace-iupac Replace non-IUPAC characters with 'N'
-l, --to-lowercase Convert sequences to lowercase
-u, --to-uppercase Convert sequences to uppercase
-o, --output FILE Write output to FILE (default: stdout)
Note: checks for duplicate IDs across all files
-v, --verbose Print progress information to stderr
-h, --help Show this help
Supported Formats
seqfu tofasta automatically detects and converts the following formats:
| Format | Description | Notes |
|---|---|---|
| FASTA | Standard FASTA format | Pass-through with optional transformations |
| FASTQ | Sanger, Illumina, Solexa | Quality scores are discarded |
| GenBank | NCBI GenBank flat file | Extracts sequences from ORIGIN section |
| EMBL | EMBL-Bank format | Extracts sequences from SQ section |
| GFF | Generic Feature Format | Extracts sequences after ##FASTA directive |
| GFA | Graphical Fragment Assembly | Extracts sequences from S (segment) lines |
| Clustal | Clustal W alignment | Preserves gaps in sequences |
| Stockholm | Stockholm alignment | Converts ‘.’ gaps to ‘-‘ |
Options
-n, --replace-iupac Replaces non-standard IUPAC characters with ‘N’. Standard IUPAC codes (A, T, G, C, N) are preserved, while ambiguous codes (R, Y, W, S, K, M, etc.) are replaced with ‘N’.
-l, --to-lowercase Converts all sequence characters to lowercase. Useful for soft-masking or compatibility with tools that expect lowercase input.
-u, --to-uppercase Converts all sequence characters to uppercase. Takes precedence over -l if both are specified.
-o, --output FILE Writes all sequences to a single output file instead of stdout. When using this option, the tool performs duplicate ID checking across all input files and will exit with an error if duplicate sequence IDs are found.
-v, --verbose Prints progress information to stder
Examples
Basic Format Conversion
Convert a GenBank file to FASTA:
seqfu tofasta genome.gbk > genome.fasta
Convert a FASTQ file (quality scores are discarded):
seqfu tofasta reads.fastq.gz > reads.fasta
Multiple Files
Process multiple files and combine into one FASTA file:
seqfu tofasta -o combined.fasta file1.gbk file2.gff file3.fastq.gz
Replace ambiguous IUPAC codes with N:
seqfu tofasta -n -u sequences.fasta > clean.fasta
Input Handling
- Gzip Support: All input files can be gzip-compressed (
.gzextension) - No STDIN: Currently,
tofastarequires file arguments and does not support reading from standard input - Format Detection: File format is automatically detected from content, not file extension
- Error Handling: The tool is strict and will exit with an error on unknown or malformed formats
Duplicate ID Detection
When using -o/--output, the tool checks for duplicate sequence IDs across all input files:
# This will fail if seq1.gbk and seq2.gbk contain sequences with the same ID
seqfu tofasta -o combined.fasta seq1.gbk seq2.gbk
Error message example:
ERROR: Duplicate sequence ID found: NZ_12345
First occurrence in a previous file
Second occurrence in: seq2.gbk
Format-Specific Behavior
GenBank/EMBL:
- Extracts accession number as sequence ID
- Extracts sequence from ORIGIN/SQ section
- Handles multiple records in a single file
GFF:
- Only processes sequences after
##FASTAdirective - Ignores feature annotations
- Preserves sequence IDs from FASTA headers
Clustal/Stockholm:
- Preserves alignment gaps in output
- Concatenates sequence fragments if alignment spans multiple blocks
- Stockholm ‘.’ gaps are converted to ‘-‘
GFA:
- Only processes S (segment) lines
- Uses segment name as sequence ID
- Ignores paths, links, and other graph elements