High-Impact Optimizations (Recommended)

1. Consolidate Depth Array Iterations ⭐⭐⭐

Current problem: When both variance and trimmed mean are enabled, we iterate through the depths array multiple times separately.

Current code (lines 506-513):

 if task.opts.doVariance:
   for depth in depths:
     agg.totalDepth += int64(depth)
     agg.sumSquaredDepth += int64(depth * depth)
 else:
   for depth in depths:
     agg.totalDepth += int64(depth)

Impact: ~30-50% faster for combined metrics Effort: Very low (10 minutes) Fix: Always calculate both in a single loop:

  for depth in depths:
    agg.totalDepth += int64(depth)
    if task.opts.doVariance:
      agg.sumSquaredDepth += int64(depth * depth)

2. Skip Depth Tracking for Zero/Low Coverage References ⭐⭐⭐

Current problem: We allocate large arrays and do full processing even when count is 0.

Impact: 50-90% faster for sparse datasets (many empty references) Effort: Low (20 minutes) Fix: Early exit after counting:

After the read counting loop

if agg.count == 0 and task.opts.trackDepths: # Skip depth tracking for zero coverage agg.totalDepth = 0 agg.trimmedMean = 0.0 # Don’t allocate depths array at all

3. Avoid Full Sort for Trimmed Mean ⭐⭐⭐

Current problem: Line 519 copies entire depths array, then sorts all values. For a 10Mbp reference, this sorts 10 million integers.

Current code: var sortedDepths = depths # COPY! algorithm.sort(sortedDepths) # O(n log n)

Impact: 3-10x faster for large references (>1Mbp) Effort: Medium (1 hour) Options:

Option A - Partial Sort (fastest for typical trim ranges):

Only need to find the percentile boundaries, not fully sort

Use partial sorting or nth_element equivalent

For 5th and 95th percentile, this is much faster

Option B - In-place Sort (easy win, no copy): if task.opts.doTrimmedMean: algorithm.sort(depths) # Sort in-place, we don’t need original order let minIdx = … # Calculate trimmed mean

Option C - Histogram-based (for integer depths < 10,000):

Build histogram of depth values

var histogram: array[10000, int] for depth in depths: if depth < 10000: inc(histogram[depth])

Find percentiles from histogram (O(n) instead of O(n log n))

4. Combine Breadth and Depth Tracking Loops ⭐⭐

Current problem: Lines 488-498 have two separate loops through alignment positions.

Current code:

Track covered positions for breadth calculation

if task.opts.trackBreadth: for pos in aln.start ..< aln.stop: # …

Track per-base depths for variance and/or trimmed mean calculation

if task.opts.trackDepths: for pos in aln.start ..< aln.stop: # …

Impact: ~30-40% faster when both are enabled Effort: Low (15 minutes) Fix: if task.opts.trackBreadth or task.opts.trackDepths: for pos in aln.start ..< aln.stop: if pos < agg.length: if task.opts.trackBreadth and not covered[pos]: covered[pos] = true inc(agg.coveredBases) if task.opts.trackDepths: inc(depths[pos])

Medium-Impact Optimizations

5. Lazy Depth Array Allocation ⭐⭐

Current problem: Lines 476-477 allocate arrays for every reference, even if most have zero coverage.

Impact: 20-40% memory savings, 10-20% faster for sparse datasets Effort: Medium (30 minutes) Fix: Only allocate when first read is encountered, or use sparse representation.

6. Reuse Depth Arrays Across References ⭐⭐

Current problem: New allocation for each reference (line 477).

Impact: 10-20% faster (reduced allocation overhead) Effort: Medium (45 minutes) Fix: Maintain a reusable buffer in worker thread, just clear/resize as needed.

7. Use Sampling for Very Large References ⭐⭐

Current problem: For huge references (>5Mbp), trimmed mean sorts millions of values.

Impact: 5-50x faster for very large references Effort: Medium (1 hour) Fix: For references >5Mbp, sample every Nth position instead of using all positions. Trimmed mean from 100k samples is statistically equivalent to 10M samples.

Performance Benchmark Estimates

For a typical metagenome with:

1000 references
100 with high coverage (>10x)
900 with zero/low coverage (<1x)
Using –all-metrics

Current performance: ~1 minute With optimizations 1-4: ~25-30 seconds (50-60% faster) With all optimizations: ~15-20 seconds (65-75% faster)

Recommended Implementation Order

#2 - Skip zero coverage (easiest, biggest win for sparse data) DONE
#1 - Consolidate depth iterations (trivial change, always beneficial) DONE
#4 - Combine breadth/depth loops (easy, good for combined metrics)
#3 - Avoid full sort (best for large references)
#5 - Lazy allocation (polish)

Would you like me to implement any of these optimizations?

● After comparing the optimized bamcountrefs.nim with the current bamtocounts.nim, I can see significant opportunities for improvement. Let me provide a comprehensive set of suggestions:

Suggested Improvements for bamtocounts.nim

Code Organization & Structure

Add named constants for clarity: const DEFAULT_EXCLUDE_FLAGS = 1796 READS_PER_MILLION = 1_000_000 BASES_PER_KILOBASE = 1000

Create structured data types:

Define a FeatureMetrics object similar to ReferenceMetrics in bamcountrefs.nim:331-338
Separate concerns between worker options and task execution
Use proper types instead of scattered variables

Multi-Threading Support

Enable parallel BAM processing:

Add –workers option to process multiple BAM files in parallel (bamcountrefs.nim:656-662)
Implement worker threads using the pattern from bamcountrefs.nim:440-570
Use thread-safe data structures (sequences instead of tables for worker results)

This is the biggest performance win - processing multiple samples concurrently.

Performance Optimizations

Zero-coverage detection:

Use BAM index to detect features with zero coverage before expensive iteration (bamcountrefs.nim:477-488)
Skip processing entirely for zero-coverage features

Eliminate code duplication:

Merge makeCountsTable and alignments_count functions (bamtocounts.nim:74-110) - they’re nearly identical
Use a single, optimized counting function

Add fast path for simple cases:

When filters are minimal (mapq=0, eflag=4), use index statistics directly (bamcountrefs.nim:466-493)

Optimize per-base operations:

Combine breadth and depth tracking in a single pass if both requested (bamcountrefs.nim:522-529)
Use consolidated loops for metric calculations (bamcountrefs.nim:536-542)

Additional Metrics

Currently bamtocounts only supports counts, RPKM, and counts/length. Add:

TPM (Transcripts Per Million) - more appropriate for RNA-seq (bamcountrefs.nim:108-143)
Mean coverage - average depth across the feature
Trimmed mean - robust mean excluding outliers (bamcountrefs.nim:545-568)
Variance - coverage variability (bamcountrefs.nim:169-189)
Covered bases - number of positions with coverage > 0
Covered ratio - breadth of coverage as fraction (bamcountrefs.nim:157-167)

Output Improvements

Multi-file output mode:

Add –output option to generate separate files per metric (bamcountrefs.nim:302-438)
Example: basename_counts.tsv, basename_rpkm.tsv, basename_tpm.tsv
Keeps stdout output for backward compatibility

Add –all-metrics flag:

Enable all metrics at once (bamcountrefs.nim:718-727)

Better Filtering

Fix –paired implementation:

Current implementation at bamtocounts.nim:100 checks read2 flag which means it only counts R1
Add –proper-pairs flag like bamcountrefs.nim:510-513 for stricter paired filtering

Strand handling improvements:

Current strand implementation is basic
Consider optimization opportunities for stranded counting

Memory Efficiency

Conditional tracking:

Only allocate per-base arrays when advanced metrics are requested (bamcountrefs.nim:496-501)
Most users only need counts - don’t waste memory on depth tracking by default

Error Handling & Validation

Add BAM index requirement: if bam.idx == nil: stderr.writeLine(“ERROR: BAM file requires index: “, bamPath) quit(1)

Better input validation:

Check that target file contains features before processing
Validate that BAM references match target chromosomes
Clear error messages when files are missing or malformed

Code Quality

Remove dead code:

Remove commented legacy code (bamtocounts.nim:42-49)
Clean up unused imports

Consistent naming:

Use _t suffix for all types (currently inconsistent)
Follow Nim conventions throughout

Add debug output:

Add diagnostic messages controlled by –debug flag (like bamcountrefs.nim)
Help users understand what’s happening during processing

Feature-Specific Optimizations

Handle multi-exon features efficiently:

The current add() function (bamtocounts.nim:112-117) concatenates coordinates with semicolons
This is handled but could be optimized for intersection tests

Better GFF/GTF parsing:

Current implementation works but could benefit from validation
Add warnings for missing features

Priority Order

High Priority (biggest impact):

Multi-threading support (10-100x speedup for multiple files)
Zero-coverage detection optimization
Eliminate code duplication in counting functions
Fix –paired filtering bug

Medium Priority: 5. Add TPM and other metrics 6. Multi-file output mode 7. Conditional memory allocation

Low Priority: 8. Code cleanup 9. Better error messages 10. Enhanced debug output

Would you like me to start implementing any of these improvements? I’d suggest starting with multi-threading support and performance optimizations as they provide the most significant benefits.