A Simple Introduction to XML and JSON

Two popular formats to store structured data are XML and JSON.
They are used everywhere in computer science and appear more and more often in bioinformatics analyses as well.

Bioinformatics and text files

Most bioinformatics file formats are simple text files. A famous example is the FASTA format for sequences.

Historically, file formats have often been designed ad-hoc to solve one specific problem, which led to a fragmented landscape of formats:

FASTA / FASTQ – nucleotide or protein sequences
SAM / BAM – read alignments
VCF – sequence variants against a reference genome
GFF / BED – genomic features such as genes, enhancers, binding sites…

These are excellent for their specific task, but they are not meant to be generic containers for arbitrary structured data.

General-purpose structured formats: XML and JSON

Two general-purpose formats that are designed for structured data are:

XML – eXtensible Markup Language
JSON – JavaScript Object Notation

XML became very popular in the early 2000s, while JSON gained momentum later and is now the de-facto standard for most web APIs.

Both formats can represent complex, hierarchical data.

XML is more formal and comes with tools for enforcing a strict schema.

JSON is simpler and lighter, which helped its adoption (for example, the BIOM 1.0 format is built on JSON). :contentReference[oaicite:1]{index=1}

The goal of this post is to give a visual, intuitive introduction to both.

JSON in a nutshell

A JSON document is composed of key–value pairs:

Keys are strings.
Values can be:
- numbers,
- strings,
- booleans,
- null,
- arrays (lists),
- or other JSON objects.

A simple JSON example

Imagine introducing yourself with your name, surname and a list of hobbies.

{
  "id": 192,
  "name": "Andrea",
  "surname": "Telatin",
  "hobbies": ["bioinformatics", "reading", "coffee"]
}

General rules:

The whole object is wrapped in { ... }.
Each item is "key": value and items are separated by commas.
Lists (arrays) are wrapped in [ ... ].

Whitespace and line breaks are only for humans. A computer is equally happy with this “minified” version:

{"id":192,"name":"Andrea","surname":"Telatin","hobbies":["bioinformatics","reading","coffee"]}

The structure is hierarchical: values can be objects or lists containing other objects and lists. The order of keys is usually not important.

XML in a nutshell

XML represents data with tags.

A simple XML encoding of the same person could be:

<Person>
  <id>192</id>
  <name>Andrea</name>
  <surname>Telatin</surname>
  <hobbies>
    <hobby>bioinformatics</hobby>
    <hobby>reading</hobby>
    <hobby>coffee</hobby>
  </hobbies>
</Person>

Each piece of data is enclosed between an opening and a closing tag:

<id>192</id>

Tags can be nested to express hierarchy, so an XML document is effectively a tree.

Lists in XML

There is no single canonical way to represent a list in XML. In the example above, hobbies is a parent element with repeated hobby children.

You can also store extra information in attributes. For example, if you want to keep the original order of hobbies:

<hobbies>
  <hobby index="1">bioinformatics</hobby>
  <hobby index="2">reading</hobby>
  <hobby index="3">coffee</hobby>
</hobbies>

Like JSON, XML can be written on a single line or nicely indented. Pretty- printing tools can reformat XML or JSON to make them easier to read.

Fun fact: modern Microsoft Office file formats (.docx, .xlsx, …) are basically ZIP archives containing XML documents and other resources such as images.

Why not just use tables?

For many tasks, we still use tabular files such as CSV or TSV. They are compact and easy to work with on the command line.

id,name,surname,hobbies
192,Andrea,Telatin,"bioinformatics,reading,coffee"

This table works, but notice that the list of hobbies is encoded as a comma-separated string inside a single column. To interpret it correctly you need to know in advance that the third column is itself a list. Nested or more complex structures become very hard (or impossible) to represent cleanly in a purely tabular format.

XML and JSON shine when we need self-describing, hierarchical data — for example, metadata or complex configuration.

XML and JSON in bioinformatics

In NGS-oriented bioinformatics, XML and JSON are less common than FASTA, FASTQ, SAM/BAM, etc., for storing the raw data. One criticism of XML is that repeating the same tag thousands of times wastes space, although compression and alternative encodings can mitigate this. JSON is more compact but still heavier than a plain TSV.

Where these formats really excel is in metadata and web APIs: they are widely used as the response format when we query web services (REST APIs, databases, repositories…).

Example: XML from PubMed

PubMed records can be retrieved as XML. For a given PubMed ID, you can:

View the HTML record in a browser:

https://www.ncbi.nlm.nih.gov/pubmed/29079838

Add ?report=XML to obtain the XML version programmatically:

https://www.ncbi.nlm.nih.gov/pubmed/29079838?report=XML

Inside that XML you’ll find a hierarchy of tags such as:

<PubmedArticle>
  <MedlineCitation>
    <Article>
      <Journal>
        <Title>Genome Announcements</Title>
      </Journal>
      <ArticleTitle>...</ArticleTitle>
      <!-- more fields here -->
    </Article>
  </MedlineCitation>
</PubmedArticle>

The path to the journal title in this tree is:

PubmedArticle → MedlineCitation → Article → Journal → Title

Visual tree viewers (for example, online XML viewers with a Tree View mode) make it easy to explore and identify these paths.

XML data from public repositories

Major sequence repositories expose rich metadata via XML (and sometimes JSON):

NCBI SRA
ENA
EBI Metagenomics

This allows you to:

query experiments and sequencing runs,
automatically download metadata in structured form,
and integrate it into your own analysis pipelines.

These aspects deserve their own tutorial, but the key takeaway is:

Being comfortable reading and navigating XML/JSON documents makes it much easier to automate large-scale metadata retrieval and processing.

Take-home messages

XML and JSON are generic, hierarchical formats for structured data.
JSON is lighter and very common for web APIs; XML is more formal and often comes with schemas.
Tabular formats (CSV/TSV) are great for simple, flat data, but struggle with nested lists and complex structures.
In bioinformatics, XML and JSON are especially useful for metadata and for interacting with online repositories and services.

Once you get used to reading their tree-like structure, both formats become powerful allies in your day-to-day bioinformatics work.

27 Nov 2018

« Powerful things you can do with the Markdown editor A small introduction to Bash scripting »

Microbiome binfies