Bioinformatics file formats
Here we introduce some of the most used bioinformatics formats.
The FASTA format is used to store one or more sequences, and it’s used for DNA or protein sequences. Generic extensions include .fa and .fasta. Some databanks tried to add extra details with extensions like .fna for nucleic acids or .faa for aminoacidic sequences.
The FASTQ format is used to store the output of sequencing machines, and stores the sequence as determined by the process of “base calling”, and an associated Phred quality score for each base.
The BED format stores the coordinates of a set of features relative to a specific reference sequence. In it’s simplest incarnation it is just a TSV (tab-separated values) file with these columns:
- Chromosome (sequence) name (required)
- Feature start (0-based) (required)
- Feature end (1-based) (required)
- Feature name
The Sequence Alignment/Map (SAM) format is a generic format for storing
the output of sequence alignment programs. It is a tab-delimited text
with a header starting with
@ and a body with the alignment records.
You can see a demo SAM file here.
The Variant Call Format (VCF) is a text file format for storing the
differences of a set of sequenced genomes compared with a reference sequence.
It’s header lines start with
##, and the column names start with
An example of a VCF file, with its lengthy header and only two variants detected is shown below: