High-Impact Optimizations (Recommended)

1. Consolidate Depth Array Iterations ⭐⭐⭐

Current problem: When both variance and trimmed mean are enabled, we iterate through the depths array multiple times separately.

Current code (lines 506-513):

 if task.opts.doVariance:
   for depth in depths:
     agg.totalDepth += int64(depth)
     agg.sumSquaredDepth += int64(depth * depth)
 else:
   for depth in depths:
     agg.totalDepth += int64(depth)

Impact: ~30-50% faster for combined metrics Effort: Very low (10 minutes) Fix: Always calculate both in a single loop:

  for depth in depths:
    agg.totalDepth += int64(depth)
    if task.opts.doVariance:
      agg.sumSquaredDepth += int64(depth * depth)

2. Skip Depth Tracking for Zero/Low Coverage References ⭐⭐⭐

Current problem: We allocate large arrays and do full processing even when count is 0.

Impact: 50-90% faster for sparse datasets (many empty references) Effort: Low (20 minutes) Fix: Early exit after counting:

After the read counting loop

if agg.count == 0 and task.opts.trackDepths: # Skip depth tracking for zero coverage agg.totalDepth = 0 agg.trimmedMean = 0.0 # Don’t allocate depths array at all


3. Avoid Full Sort for Trimmed Mean ⭐⭐⭐

Current problem: Line 519 copies entire depths array, then sorts all values. For a 10Mbp reference, this sorts 10 million integers.

Current code: var sortedDepths = depths # COPY! algorithm.sort(sortedDepths) # O(n log n)

Impact: 3-10x faster for large references (>1Mbp) Effort: Medium (1 hour) Options:

Option A - Partial Sort (fastest for typical trim ranges):

Only need to find the percentile boundaries, not fully sort

Use partial sorting or nth_element equivalent

For 5th and 95th percentile, this is much faster

Option B - In-place Sort (easy win, no copy): if task.opts.doTrimmedMean: algorithm.sort(depths) # Sort in-place, we don’t need original order let minIdx = … # Calculate trimmed mean

Option C - Histogram-based (for integer depths < 10,000):

Build histogram of depth values

var histogram: array[10000, int] for depth in depths: if depth < 10000: inc(histogram[depth])

Find percentiles from histogram (O(n) instead of O(n log n))


4. Combine Breadth and Depth Tracking Loops ⭐⭐

Current problem: Lines 488-498 have two separate loops through alignment positions.

Current code:

Track covered positions for breadth calculation

if task.opts.trackBreadth: for pos in aln.start ..< aln.stop: # …

Track per-base depths for variance and/or trimmed mean calculation

if task.opts.trackDepths: for pos in aln.start ..< aln.stop: # …

Impact: ~30-40% faster when both are enabled Effort: Low (15 minutes) Fix: if task.opts.trackBreadth or task.opts.trackDepths: for pos in aln.start ..< aln.stop: if pos < agg.length: if task.opts.trackBreadth and not covered[pos]: covered[pos] = true inc(agg.coveredBases) if task.opts.trackDepths: inc(depths[pos])


Medium-Impact Optimizations

5. Lazy Depth Array Allocation ⭐⭐

Current problem: Lines 476-477 allocate arrays for every reference, even if most have zero coverage.

Impact: 20-40% memory savings, 10-20% faster for sparse datasets Effort: Medium (30 minutes) Fix: Only allocate when first read is encountered, or use sparse representation.


6. Reuse Depth Arrays Across References ⭐⭐

Current problem: New allocation for each reference (line 477).

Impact: 10-20% faster (reduced allocation overhead) Effort: Medium (45 minutes) Fix: Maintain a reusable buffer in worker thread, just clear/resize as needed.


7. Use Sampling for Very Large References ⭐⭐

Current problem: For huge references (>5Mbp), trimmed mean sorts millions of values.

Impact: 5-50x faster for very large references Effort: Medium (1 hour) Fix: For references >5Mbp, sample every Nth position instead of using all positions. Trimmed mean from 100k samples is statistically equivalent to 10M samples.


Performance Benchmark Estimates

For a typical metagenome with:

  • 1000 references
  • 100 with high coverage (>10x)
  • 900 with zero/low coverage (<1x)
  • Using –all-metrics

Current performance: ~1 minute With optimizations 1-4: ~25-30 seconds (50-60% faster) With all optimizations: ~15-20 seconds (65-75% faster)


Recommended Implementation Order

  1. #2 - Skip zero coverage (easiest, biggest win for sparse data) DONE

  2. #1 - Consolidate depth iterations (trivial change, always beneficial) DONE

  3. #4 - Combine breadth/depth loops (easy, good for combined metrics)

  4. #3 - Avoid full sort (best for large references)

  5. #5 - Lazy allocation (polish)

Would you like me to implement any of these optimizations?


● After comparing the optimized bamcountrefs.nim with the current bamtocounts.nim, I can see significant opportunities for improvement. Let me provide a comprehensive set of suggestions:

Suggested Improvements for bamtocounts.nim

  1. Code Organization & Structure

Add named constants for clarity: const DEFAULT_EXCLUDE_FLAGS = 1796 READS_PER_MILLION = 1_000_000 BASES_PER_KILOBASE = 1000

Create structured data types:

  • Define a FeatureMetrics object similar to ReferenceMetrics in bamcountrefs.nim:331-338
  • Separate concerns between worker options and task execution
  • Use proper types instead of scattered variables
  1. Multi-Threading Support

Enable parallel BAM processing:

  • Add –workers option to process multiple BAM files in parallel (bamcountrefs.nim:656-662)
  • Implement worker threads using the pattern from bamcountrefs.nim:440-570
  • Use thread-safe data structures (sequences instead of tables for worker results)

This is the biggest performance win - processing multiple samples concurrently.

  1. Performance Optimizations

Zero-coverage detection:

  • Use BAM index to detect features with zero coverage before expensive iteration (bamcountrefs.nim:477-488)
  • Skip processing entirely for zero-coverage features

Eliminate code duplication:

  • Merge makeCountsTable and alignments_count functions (bamtocounts.nim:74-110) - they’re nearly identical
  • Use a single, optimized counting function

Add fast path for simple cases:

  • When filters are minimal (mapq=0, eflag=4), use index statistics directly (bamcountrefs.nim:466-493)

Optimize per-base operations:

  • Combine breadth and depth tracking in a single pass if both requested (bamcountrefs.nim:522-529)
  • Use consolidated loops for metric calculations (bamcountrefs.nim:536-542)
  1. Additional Metrics

Currently bamtocounts only supports counts, RPKM, and counts/length. Add:

  • TPM (Transcripts Per Million) - more appropriate for RNA-seq (bamcountrefs.nim:108-143)
  • Mean coverage - average depth across the feature
  • Trimmed mean - robust mean excluding outliers (bamcountrefs.nim:545-568)
  • Variance - coverage variability (bamcountrefs.nim:169-189)
  • Covered bases - number of positions with coverage > 0
  • Covered ratio - breadth of coverage as fraction (bamcountrefs.nim:157-167)
  1. Output Improvements

Multi-file output mode:

  • Add –output option to generate separate files per metric (bamcountrefs.nim:302-438)
  • Example: basename_counts.tsv, basename_rpkm.tsv, basename_tpm.tsv
  • Keeps stdout output for backward compatibility

Add –all-metrics flag:

  • Enable all metrics at once (bamcountrefs.nim:718-727)
  1. Better Filtering

Fix –paired implementation:

  • Current implementation at bamtocounts.nim:100 checks read2 flag which means it only counts R1
  • Add –proper-pairs flag like bamcountrefs.nim:510-513 for stricter paired filtering

Strand handling improvements:

  • Current strand implementation is basic
  • Consider optimization opportunities for stranded counting
  1. Memory Efficiency

Conditional tracking:

  • Only allocate per-base arrays when advanced metrics are requested (bamcountrefs.nim:496-501)
  • Most users only need counts - don’t waste memory on depth tracking by default
  1. Error Handling & Validation

Add BAM index requirement: if bam.idx == nil: stderr.writeLine(“ERROR: BAM file requires index: “, bamPath) quit(1)

Better input validation:

  • Check that target file contains features before processing
  • Validate that BAM references match target chromosomes
  • Clear error messages when files are missing or malformed
  1. Code Quality

Remove dead code:

  • Remove commented legacy code (bamtocounts.nim:42-49)
  • Clean up unused imports

Consistent naming:

  • Use _t suffix for all types (currently inconsistent)
  • Follow Nim conventions throughout

Add debug output:

  • Add diagnostic messages controlled by –debug flag (like bamcountrefs.nim)
  • Help users understand what’s happening during processing
  1. Feature-Specific Optimizations

Handle multi-exon features efficiently:

  • The current add() function (bamtocounts.nim:112-117) concatenates coordinates with semicolons
  • This is handled but could be optimized for intersection tests

Better GFF/GTF parsing:

  • Current implementation works but could benefit from validation
  • Add warnings for missing features

Priority Order

High Priority (biggest impact):

  1. Multi-threading support (10-100x speedup for multiple files)
  2. Zero-coverage detection optimization
  3. Eliminate code duplication in counting functions
  4. Fix –paired filtering bug

Medium Priority: 5. Add TPM and other metrics 6. Multi-file output mode 7. Conditional memory allocation

Low Priority: 8. Code cleanup 9. Better error messages 10. Enhanced debug output

Would you like me to start implementing any of these improvements? I’d suggest starting with multi-threading support and performance optimizations as they provide the most significant benefits.