Estimating gigabytes from gigabases

colinbrislawn · September 2, 2021, 6:41pm

Illumina publishes it's sequencing depths in terms of gigabases (1 billion bases == 1e9 bp), which makes sense.

But I also care about file sizes.

Does anyone have a heuristic for estimating gigabytes of .fastq data from gigabases of Illumina reads?

(This must depend on the size and complexity of the library in questions, but there has got to be a 'typical' range along with some theoretical upper and lower bounds.)

I'm also interested in compressed fastq sizes, but that's a follow-up question.

Nicholas_Bokulich · September 3, 2021, 2:12pm

Hi @colinbrislawn,
interesting question! It depends on a few things!

Depends on the character and file encoding.

Depends on length/content of the header lines as well!

But based on gigabases alone and assuming:

that sequences and quality scores are the majority of content in a fastq file, so disregarding header lines for the moment;
UTF-8 or ASCII

that would mean that 1 GB == 1e9 nt X 1 byte/base X 2 characters per base (nt and quality score) == 2e9 bytes == 2 GB

I don't have an exact answer, but let's do some back-of-the-envelope calculations. I don't have heaps of fastq data laying around to confirm, but a quick look at character count (with and without header lines) and filesize of the Greengenes rep seqs (fasta, close enough) confirms as much, e.g., just to grab the largest file:

# number of bytes
wc -c 99_otus.fasta
 293189424 99_otus.fasta
# number of characters
wc -m 99_otus.fasta 
 293189424 99_otus.fasta
# number of characters, excluding header lines
grep -v '>' 99_otus.fasta | wc -m
 291498321 99_otus.fasta

The character count and byte count match!

So for fastq this will probably 1 gigabases == 2 gigabytes!

Might be a little different since Q scores can include some special characters, but I expect the difference is not all that big...

colinbrislawn · September 3, 2021, 2:40pm

Thanks for the discussion, Nick!

I like how you approached this starting from the file formats themselves.

I ignored the theory and looked up the last few Illumina runs we did.

These are basepair counts and fastq.gz file sizes after bcl2fastq demultiplexing, which uses gzip -l 4

(Colors are the 'Time' each run started, which differentiates the runs.)

lm(GigaByte ~ GigaBase)

term	estimate	std.error	statistic	p.value
GigaBase	0.5538953	0.0035846	154.5227	0

1 billion basepairs take up around half a gig (compressed with gzip -l 4)

gz does between 2x to 5x compression on simple text, so this is pretty close to 1 base = 2 bytes without compression.

Code on GitHub: GitHub - colinbrislawn/gigabase-gigabyte