How accurate is the BAM estimate?

It is an approximation only. Real BAM size depends heavily on read-name length, auxiliary tags, mapping, duplicate marking and data complexity. Treat it as a ballpark, not a guarantee.

Why is BAM smaller than raw FASTQ?

BAM packs each base into 4 bits and is BGZF-compressed, so the sequence is much smaller. Quality strings dominate and compress poorly, which is why BAM is not as small as you might expect.

Sequencing file size estimator

Estimate FASTQ (raw + gzip) or BAM file size from read count, read length and layout. BAM is a rough approximation, since real sizes vary with data.

How it works

Formula

FASTQ raw bytes ≈ reads × (2 × read length + ~50): one sequence char + one quality char per base, plus ~50 bytes per read for header/"+"/newlines; gzip ≈ raw × 0.25. BAM ≈ reads × (1.5 × read length + ~64) × ~0.55 (BGZF): 4-bit-packed bases + 1 byte quality per base + record overhead, then gzip. Paired-end counts both mates.

Worked example

FASTQ, 1,000,000 single-end 150 bp reads: 1,000,000 × (2 × 150 + 50) = 350,000,000 bytes ≈ 334 MB raw, ≈ 83 MB gzipped. The same reads as BAM ≈ 1,000,000 × (1.5 × 150 + 64) × 0.55 ≈ 159 MB.

When to use it

To budget disk and transfer before a run or download — how much space FASTQ vs BAM will need, and roughly how small each gets after compression.

Sensible defaults

Defaults estimate a large paired-end run (400M pairs × 150 bp) as FASTQ. The compression ratios are documented round figures, not vendor numbers; your real ratio depends on data complexity.

FAQ

How accurate is the BAM estimate?: It is an approximation only. Real BAM size depends heavily on read-name length, auxiliary tags, mapping, duplicate marking and data complexity. Treat it as a ballpark, not a guarantee.
Why is BAM smaller than raw FASTQ?: BAM packs each base into 4 bits and is BGZF-compressed, so the sequence is much smaller. Quality strings dominate and compress poorly, which is why BAM is not as small as you might expect.