uneven coverage of sequencing and rarefactions

MichelaRiba · April 4, 2023, 3:30pm

Hi,

in recente experiments with 16S sequencing I have experienced uneven coverage among samples processed which is seen firstly in the fastQC report for both R1 and R2 and also for R1 and R2 merged sequences. In specific I fonud for example 500,000 reads for one sample, 30,000 for another, 6,000 for another. I would apprecaite if you could suggest a protocol both laboratory and bioinformatics to deal with those situations:
In general I would exclude outliers if I ca claffify them e.g. if on ly one samplke with 6,000 reads or 500,000. However if the variability impacts also the vast majority of samples I think I should go dor estimating Alpha.diversity with rarefactions using as maximal depth the one of the sample with less sequences, however whe calculating OTU representation as raw counts or relative abundance of taxa what would you suggest? May I go for rarefied OTu table and classify a rarefied OTU table or use instead the complete sequenc e files with uneven coverage?
If you could suggest the commands to obtain rarefied OTU tables and taxa from rarefied tables, I include the command I would like to use if I have even coverage among samples

qiime vsearch cluster-features-open-reference \
	--i-table table10.qza \
	--i-sequences rep-seqs10.qza \
	--i-reference-sequences 85_otus.qza \
	--p-perc-identity 0.85 \
	--o-clustered-table table10-or-85.qza \
	--o-clustered-sequences rep-seqs10-or-85.qza \
	--o-new-reference-sequences new-ref-seqs10-or-85.qza

colinbrislawn · April 4, 2023, 3:41pm

Hello Michela,

Uneven sequencing is common on the Illumina platform. There are a few things you can do upstream in the sample preparation phase that should help with this. (Because this is a problem with sequencing, it's best to fix it before sequencing, instead of trying to deal with it downstream during analysis.)

The big thing my wet-lab colleagues did was normalize the amount of amplicon PCR product added to the Illumina sequencing run.

This is the basic workflow:
extract nucleotides -> PCR -> amplified libraries -> Illumina sequencing

They added a measurement and normalization step:
extract nucleotides -> PCR -> amplified libraries -> measure DNA concentration (Qubit, nanodrop, etc.) -> calculate needed volume to add a consistent mass of DNA -> Illumina
sequencing

Because the mass of PCR the product was more even, the reads per sample were more even.

Let me know if this makes sense

Are you and your team already doing something like this? If so, what concentrations of DNA are you measuring (ng/ml)?

MichelaRiba · April 5, 2023, 7:34am

Hi,

thanks a lot for this kind suggestion;

I will be back after discussion with the genomics laboratory.
meanhile, if you could suggest a way of subsampling which could guarantee no sampling bias I would appreciate it very much.
My idea would be just subsampling the original files (after removing clear outliers in terms of the number of sequennces) .
I would for example use this command to obtain e.g. 10,000 sequences from the original R1 and R2 fastqs and then proceed

fastq/%.sub.fastq: fastq/%.fastq
        $(CONDA_ACTIVATE) Migenomic_tools;\
        **seqtk sample -s100 $< 10000 > $@**

I thank you so much,

Michela

colinbrislawn · April 5, 2023, 5:21pm

Great! Let's hear what they have to say.

You can do this, but there is going to be a tradeoff between keeping more samples and keeping more data in each sample.

The issue is that samples with few reads have lower resolution. (Just like a photo with fewer pixels has a lower resolution.)

One common method of normalization involves subsampling, like to 10k reads per sample as you mentioned. But what do you do with samples that have less than 10k reads? There is no way to increase resolution that you do not have, so many normalization pipelines simply drop these samples from the normalized output.

This is the tradeoff:

keep all samples, removing resolution from the deeply sequenced samples so all are comparable
keep just deeply sequenced samples, removing samples that have fewer reads

There is no way to do both.

For the messy, academic debate about this tradeoff, see these two papers:
Why subsampling is (always!) bad: Waste not, want not: why rarefying microbiome data is inadmissible - PubMed
Why subsampling is (often!) fine: Normalization and microbial differential abundance strategies depend upon data characteristics - PMC

MichelaRiba · April 30, 2023, 8:56am

Hi,

coming back to you after discussion with the sequencing team, partially still ongoing:

they try to dilute the PCR product to have the same picomolar concentration
however the sequencing performance in a percentage of samples is not the expected.
they told that some laboratories do "pilot runs" to test the performance of samples to adjust the concentration and then do the "real" run , however the consider this not right.

to conclude by now I have no way of ameliorating the input sequences and so:

I exclude frank outliers
I subsample to even coverage for the rest
this is functioning roughly for gut samples, not for less diverse samples for what I woudl dedicate a separate post

colinbrislawn · April 30, 2023, 12:17pm

Good morning Michela,

This is good! This is the wet-lap normalization method I was suggesting in my first post.

It sounds like they did their best. Sometimes samples are uneven

Unfortunately, I think you are correct. Without another sequencing run, there's not much to be done for these samples...

Make sure you double-check methods when doing statistical tests. Some methods request normalized data, and some do their own normalization and work best on raw data. Sometimes, omitting outliers is all you need!