interesting question! It depends on a few things!
Depends on the character and file encoding.
Depends on length/content of the header lines as well!
But based on gigabases alone and assuming:
- that sequences and quality scores are the majority of content in a fastq file, so disregarding header lines for the moment;
- UTF-8 or ASCII
that would mean that 1 GB == 1e9 nt X 1 byte/base X 2 characters per base (nt and quality score) == 2e9 bytes == 2 GB
I don't have an exact answer, but let's do some back-of-the-envelope calculations. I don't have heaps of fastq data laying around to confirm, but a quick look at character count (with and without header lines) and filesize of the Greengenes rep seqs (fasta, close enough) confirms as much, e.g., just to grab the largest file:
# number of bytes
wc -c 99_otus.fasta
# number of characters
wc -m 99_otus.fasta
# number of characters, excluding header lines
grep -v '>' 99_otus.fasta | wc -m
The character count and byte count match!
So for fastq this will probably 1 gigabases == 2 gigabytes!
Might be a little different since Q scores can include some special characters, but I expect the difference is not all that big...