Simulated BAMs

A script to generate simulated BAM files for a single-chromosome genome ships with the repository, to measure the effects of read length and coverage on the performance of the program.

Using the script

usage: simulate-long-bam.py [-h] -o OUTPUT [-l LENGTH] [-c COVERAGE]
                            [-m MIN_LEN] [-M MAX_LEN] [-s SEED] [-r]
                            [--progress PROGRESS]

Simulate a BAM file with long reads mapped against a hypothetical genome

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output BAM file
  -l LENGTH, --length LENGTH
                        Length of the genome [default: 10M]
  -c COVERAGE, --coverage COVERAGE
                        Approximate coverage [default: 50]
  -m MIN_LEN, --min-len MIN_LEN
                        Minimum read length [default: 1000]
  -M MAX_LEN, --max-len MAX_LEN
                        Minimum read length [default: 10000]
  -s SEED, --seed SEED  Random seed [default: 42]
  -r, --randomcigar     Generate a random CIGAR
  --progress PROGRESS   Print progress every INT reads [default: 10000]

In normal mode all reads have no INDELs, while in --randomcigar mode they have random insertion and deletions. Some program will check CIGAR operations (BamToCov does not).

Example datasets

The basic dataset is a 10 Mbp genome, 10X coverage:

# Long reads (genome size: 10 Mbp) with perfect matches and random CIGARs
simulate-long-bam.py -m 1000 -M 10000  -o test/sim/short_match.bam 
simulate-long-bam.py -m 1000 -M 10000 -r -o test/sim/short_cigar.bam 

# Short reads (genome size: 10 Mbp) with perfect matches and random CIGARs
simulate-long-bam.py -m 100 -M 300 -o test/sim/short_match.bam 
simulate-long-bam.py -m 100 -M 300 -r  -o test/sim/short_match.bam 

Effect of coverage and genome size have been checked varying the appropriate parameters.

Results

Memory

Peak memory usage in synthetic BAM files with a single chromosome 10 Mbp long and 10X coverage (long and short refers to read lengths)

Tool long match long INDELs short match short INDELs
bamtocov 2,172 2,176 2,176 2,236
covtobed 3,784 3,768 3,856 3,840
megadepth 45,516 45,488 45,612 45,552
mosdepth 82,268 82,148 82,272 82,284
mosdepth fast 82,180 82,204 82,468 82,244

Speed

Long reads (1000 bp to 10,000 bp):

Command Mean (s) Min (s) Max (s) Relative
megadepth --coverage long_100M_match.bam 1.954 ± 0.169 1.838 2.148 1.21 ± 0.13
bamtocov long_100M_match.bam 1.617 ± 0.106 1.510 1.722 1.00
covtobed long_100M_match.bam 4.703 ± 0.066 4.628 4.748 2.91 ± 0.19
mosdepth -x prefix long_100M_match.bam 3.873 ± 0.437 3.568 4.373 2.40 ± 0.31

Short reads (100 bp to 300 bp):

Command Mean (s) Min (s) Max (s) Relative
megadepth --coverage short_100M_match.bam 5.552 ± 0.334 5.288 5.928 1.00
bamtocov short_100M_match.bam 21.147 ± 0.362 20.742 21.439 3.81 ± 0.24
covtobed short_100M_match.bam 25.036 ± 0.788 24.579 25.946 4.51 ± 0.31
mosdepth -x prefix short_100M_match.bam 10.778 ± 0.894 9.949 11.725 1.94 ± 0.20

Large files

simulate-long-bam.py -l 400M -c 100 -m 1000 -M 10000 -x 3 -r --progress 100 -o local/400Mb_100X3_1000-10000.bam
simulate-long-bam.py -l 400M -c 100 -m 100 -M 300    -x 3 -r --progress 100 -o local/400Mb_100X3_100-300.bam