Andrea Telatin
Andrea Telatin Senior bioinformatician at the Quadram Institute Bioscience, Norwich.

Bioinformatics file formats

Bioinformatics file formats

Here we introduce some of the most used bioinformatics formats.

FASTA format

The FASTA format is used to store one or more sequences, and it’s used for DNA or protein sequences. Generic extensions include .fa and .fasta. Some databanks tried to add extra details with extensions like .fna for nucleic acids or .faa for aminoacidic sequences.

FASTQ format

The FASTQ format is used to store the output of sequencing machines, and stores the sequence as determined by the process of “base calling”, and an associated Phred quality score for each base.

BED format

The BED format stores the coordinates of a set of features relative to a specific reference sequence. In it’s simplest incarnation it is just a TSV (tab-separated values) file with these columns:

  1. Chromosome (sequence) name (required)
  2. Feature start (0-based) (required)
  3. Feature end (1-based) (required)
  4. Feature name
  5. Score
  6. Strand

GFF/GTF format

SAM format

The Sequence Alignment/Map (SAM) format is a generic format for storing the output of sequence alignment programs. It is a tab-delimited text with a header starting with @ and a body with the alignment records.

You can see a demo SAM file here.

VCF format

The Variant Call Format (VCF) is a text file format for storing the differences of a set of sequenced genomes compared with a reference sequence. It’s header lines start with ##, and the column names start with #.

An example of a VCF file, with its lengthy header and only two variants detected is shown below: