Revamping the N50 utility with new metrics
My first Perl Module has been an N50 calculator, that to me is sort of an Hello World considering I spent most of my PhD trying to improve one (lovely) assembly.
I tried to make its bundled utility, called simply n50, useful both for interactive sessions and for scripts, and the utility itself has been my first tool packaged in bioconda.
But… a script that does a relatively small set of things, does it really need to be packaged with a module for its calculations?
A new metric added in no time
On April 8th, Heng Li posted a comment on a “new” metric to evaluate assemblies, that he called auN, interpreted as the area under the Nx curve.
On April 9th the new metric was already added to the Proch::N50 module, and its n50 utility, and the day after it was (automatically) available on Bioconda, so that a simple
conda update -c bioconda n50
can bring this glorious new output:
.-------------------------------------------------------------------------------------------------.
| File | Seqs | Total bp | N50 | min | max | N75 | N90 | auN |
+-----------------------+------+-----------+--------+-----+---------+--------+--------+-----------+
| PID_0175_1_ctgs.fasta | 149 | 2,178,533 | 46,875 | 107 | 137,557 | 20,673 | 10,425 | 51,415.06 |
| PID_0175_5_ctgs.fasta | 138 | 2,173,292 | 46,425 | 107 | 133,990 | 20,547 | 10,933 | 51,060.62 |
| PID_0175_2_ctgs.fasta | 154 | 2,169,751 | 44,475 | 107 | 121,712 | 20,547 | 10,730 | 49,174.07 |
| PID_0175_3_ctgs.fasta | 148 | 2,170,823 | 44,409 | 107 | 104,016 | 22,741 | 10,730 | 46,608.66 |
| PID_0175_4_ctgs.fasta | 153 | 2,170,404 | 43,522 | 107 | 104,016 | 17,386 | 10,399 | 45,262.04 |
| PID_0175_6_ctgs.fasta | 147 | 1,976,499 | 39,640 | 104 | 137,087 | 19,763 | 10,509 | 48,284.94 |
'-----------------------+------+-----------+--------+-----+---------+--------+--------+-----------'
(plus the usual JSON, CSV, TSV outputs, the ability to sort by different columns etc…)
All of this has been possible with limited edits thanks to a better organization of the code, but intimately believed it was only true for larger projects when multiple developers needs to coordinate their work.
I heard this so many times when I started and it’s nice to see that it can be true also for smaller projects.