Bioinformatics file formats
Here we introduce some of the most used bioinformatics formats.
FASTA format
The FASTA format is used to store one or more sequences, and it’s used for DNA or protein sequences. Generic extensions include .fa and .fasta. Some databanks tried to add extra details with extensions like .fna for nucleic acids or .faa for aminoacidic sequences.
FASTQ format
The FASTQ format is used to store the output of sequencing machines, and stores the sequence as determined by the process of “base calling”, and an associated Phred quality score for each base.
BED format
The BED format stores the coordinates of a set of features relative to a specific reference sequence. In it’s simplest incarnation it is just a TSV (tab-separated values) file with these columns:
- Chromosome (sequence) name (required)
- Feature start (0-based) (required)
- Feature end (1-based) (required)
- Feature name
- Score
- Strand
- …
GFF/GTF format
SAM format
The Sequence Alignment/Map (SAM) format is a generic format for storing
the output of sequence alignment programs. It is a tab-delimited text
with a header starting with @
and a body with the alignment records.
You can see a demo SAM file here.
VCF format
The Variant Call Format (VCF) is a text file format for storing the
differences of a set of sequenced genomes compared with a reference sequence.
It’s header lines start with ##
, and the column names start with #
.
An example of a VCF file, with its lengthy header and only two variants detected is shown below: