High-Impact Optimizations (Recommended)
1. Consolidate Depth Array Iterations ⭐⭐⭐
Current problem: When both variance and trimmed mean are enabled, we iterate through the depths array multiple times separately.
Current code (lines 506-513):
if task.opts.doVariance:
for depth in depths:
agg.totalDepth += int64(depth)
agg.sumSquaredDepth += int64(depth * depth)
else:
for depth in depths:
agg.totalDepth += int64(depth)
Impact: ~30-50% faster for combined metrics Effort: Very low (10 minutes) Fix: Always calculate both in a single loop:
for depth in depths:
agg.totalDepth += int64(depth)
if task.opts.doVariance:
agg.sumSquaredDepth += int64(depth * depth)
2. Skip Depth Tracking for Zero/Low Coverage References ⭐⭐⭐
Current problem: We allocate large arrays and do full processing even when count is 0.
Impact: 50-90% faster for sparse datasets (many empty references) Effort: Low (20 minutes) Fix: Early exit after counting:
After the read counting loop
if agg.count == 0 and task.opts.trackDepths: # Skip depth tracking for zero coverage agg.totalDepth = 0 agg.trimmedMean = 0.0 # Don’t allocate depths array at all
3. Avoid Full Sort for Trimmed Mean ⭐⭐⭐
Current problem: Line 519 copies entire depths array, then sorts all values. For a 10Mbp reference, this sorts 10 million integers.
Current code: var sortedDepths = depths # COPY! algorithm.sort(sortedDepths) # O(n log n)
Impact: 3-10x faster for large references (>1Mbp) Effort: Medium (1 hour) Options:
Option A - Partial Sort (fastest for typical trim ranges):
Only need to find the percentile boundaries, not fully sort
Use partial sorting or nth_element equivalent
For 5th and 95th percentile, this is much faster
Option B - In-place Sort (easy win, no copy): if task.opts.doTrimmedMean: algorithm.sort(depths) # Sort in-place, we don’t need original order let minIdx = … # Calculate trimmed mean
Option C - Histogram-based (for integer depths < 10,000):
Build histogram of depth values
var histogram: array[10000, int] for depth in depths: if depth < 10000: inc(histogram[depth])
Find percentiles from histogram (O(n) instead of O(n log n))
4. Combine Breadth and Depth Tracking Loops ⭐⭐
Current problem: Lines 488-498 have two separate loops through alignment positions.
Current code:
Track covered positions for breadth calculation
if task.opts.trackBreadth: for pos in aln.start ..< aln.stop: # …
Track per-base depths for variance and/or trimmed mean calculation
if task.opts.trackDepths: for pos in aln.start ..< aln.stop: # …
Impact: ~30-40% faster when both are enabled Effort: Low (15 minutes) Fix: if task.opts.trackBreadth or task.opts.trackDepths: for pos in aln.start ..< aln.stop: if pos < agg.length: if task.opts.trackBreadth and not covered[pos]: covered[pos] = true inc(agg.coveredBases) if task.opts.trackDepths: inc(depths[pos])
Medium-Impact Optimizations
5. Lazy Depth Array Allocation ⭐⭐
Current problem: Lines 476-477 allocate arrays for every reference, even if most have zero coverage.
Impact: 20-40% memory savings, 10-20% faster for sparse datasets Effort: Medium (30 minutes) Fix: Only allocate when first read is encountered, or use sparse representation.
6. Reuse Depth Arrays Across References ⭐⭐
Current problem: New allocation for each reference (line 477).
Impact: 10-20% faster (reduced allocation overhead) Effort: Medium (45 minutes) Fix: Maintain a reusable buffer in worker thread, just clear/resize as needed.
7. Use Sampling for Very Large References ⭐⭐
Current problem: For huge references (>5Mbp), trimmed mean sorts millions of values.
Impact: 5-50x faster for very large references Effort: Medium (1 hour) Fix: For references >5Mbp, sample every Nth position instead of using all positions. Trimmed mean from 100k samples is statistically equivalent to 10M samples.
Performance Benchmark Estimates
For a typical metagenome with:
- 1000 references
- 100 with high coverage (>10x)
- 900 with zero/low coverage (<1x)
- Using –all-metrics
Current performance: ~1 minute With optimizations 1-4: ~25-30 seconds (50-60% faster) With all optimizations: ~15-20 seconds (65-75% faster)
Recommended Implementation Order
-
#2 - Skip zero coverage (easiest, biggest win for sparse data) DONE
-
#1 - Consolidate depth iterations (trivial change, always beneficial) DONE
-
#4 - Combine breadth/depth loops (easy, good for combined metrics)
-
#3 - Avoid full sort (best for large references)
-
#5 - Lazy allocation (polish)
Would you like me to implement any of these optimizations?
● After comparing the optimized bamcountrefs.nim with the current bamtocounts.nim, I can see significant opportunities for improvement. Let me provide a comprehensive set of suggestions:
Suggested Improvements for bamtocounts.nim
- Code Organization & Structure
Add named constants for clarity: const DEFAULT_EXCLUDE_FLAGS = 1796 READS_PER_MILLION = 1_000_000 BASES_PER_KILOBASE = 1000
Create structured data types:
- Define a FeatureMetrics object similar to ReferenceMetrics in bamcountrefs.nim:331-338
- Separate concerns between worker options and task execution
- Use proper types instead of scattered variables
- Multi-Threading Support
Enable parallel BAM processing:
- Add –workers option to process multiple BAM files in parallel (bamcountrefs.nim:656-662)
- Implement worker threads using the pattern from bamcountrefs.nim:440-570
- Use thread-safe data structures (sequences instead of tables for worker results)
This is the biggest performance win - processing multiple samples concurrently.
- Performance Optimizations
Zero-coverage detection:
- Use BAM index to detect features with zero coverage before expensive iteration (bamcountrefs.nim:477-488)
- Skip processing entirely for zero-coverage features
Eliminate code duplication:
- Merge makeCountsTable and alignments_count functions (bamtocounts.nim:74-110) - they’re nearly identical
- Use a single, optimized counting function
Add fast path for simple cases:
- When filters are minimal (mapq=0, eflag=4), use index statistics directly (bamcountrefs.nim:466-493)
Optimize per-base operations:
- Combine breadth and depth tracking in a single pass if both requested (bamcountrefs.nim:522-529)
- Use consolidated loops for metric calculations (bamcountrefs.nim:536-542)
- Additional Metrics
Currently bamtocounts only supports counts, RPKM, and counts/length. Add:
- TPM (Transcripts Per Million) - more appropriate for RNA-seq (bamcountrefs.nim:108-143)
- Mean coverage - average depth across the feature
- Trimmed mean - robust mean excluding outliers (bamcountrefs.nim:545-568)
- Variance - coverage variability (bamcountrefs.nim:169-189)
- Covered bases - number of positions with coverage > 0
- Covered ratio - breadth of coverage as fraction (bamcountrefs.nim:157-167)
- Output Improvements
Multi-file output mode:
- Add –output option to generate separate files per metric (bamcountrefs.nim:302-438)
- Example: basename_counts.tsv, basename_rpkm.tsv, basename_tpm.tsv
- Keeps stdout output for backward compatibility
Add –all-metrics flag:
- Enable all metrics at once (bamcountrefs.nim:718-727)
- Better Filtering
Fix –paired implementation:
- Current implementation at bamtocounts.nim:100 checks read2 flag which means it only counts R1
- Add –proper-pairs flag like bamcountrefs.nim:510-513 for stricter paired filtering
Strand handling improvements:
- Current strand implementation is basic
- Consider optimization opportunities for stranded counting
- Memory Efficiency
Conditional tracking:
- Only allocate per-base arrays when advanced metrics are requested (bamcountrefs.nim:496-501)
- Most users only need counts - don’t waste memory on depth tracking by default
- Error Handling & Validation
Add BAM index requirement: if bam.idx == nil: stderr.writeLine(“ERROR: BAM file requires index: “, bamPath) quit(1)
Better input validation:
- Check that target file contains features before processing
- Validate that BAM references match target chromosomes
- Clear error messages when files are missing or malformed
- Code Quality
Remove dead code:
- Remove commented legacy code (bamtocounts.nim:42-49)
- Clean up unused imports
Consistent naming:
- Use _t suffix for all types (currently inconsistent)
- Follow Nim conventions throughout
Add debug output:
- Add diagnostic messages controlled by –debug flag (like bamcountrefs.nim)
- Help users understand what’s happening during processing
- Feature-Specific Optimizations
Handle multi-exon features efficiently:
- The current add() function (bamtocounts.nim:112-117) concatenates coordinates with semicolons
- This is handled but could be optimized for intersection tests
Better GFF/GTF parsing:
- Current implementation works but could benefit from validation
- Add warnings for missing features
Priority Order
High Priority (biggest impact):
- Multi-threading support (10-100x speedup for multiple files)
- Zero-coverage detection optimization
- Eliminate code duplication in counting functions
- Fix –paired filtering bug
Medium Priority: 5. Add TPM and other metrics 6. Multi-file output mode 7. Conditional memory allocation
Low Priority: 8. Code cleanup 9. Better error messages 10. Enhanced debug output
Would you like me to start implementing any of these improvements? I’d suggest starting with multi-threading support and performance optimizations as they provide the most significant benefits.