in recente experiments with 16S sequencing I have experienced uneven coverage among samples processed which is seen firstly in the fastQC report for both R1 and R2 and also for R1 and R2 merged sequences. In specific I fonud for example 500,000 reads for one sample, 30,000 for another, 6,000 for another. I would apprecaite if you could suggest a protocol both laboratory and bioinformatics to deal with those situations:
In general I would exclude outliers if I ca claffify them e.g. if on ly one samplke with 6,000 reads or 500,000. However if the variability impacts also the vast majority of samples I think I should go dor estimating Alpha.diversity with rarefactions using as maximal depth the one of the sample with less sequences, however whe calculating OTU representation as raw counts or relative abundance of taxa what would you suggest? May I go for rarefied OTu table and classify a rarefied OTU table or use instead the complete sequenc e files with uneven coverage?
If you could suggest the commands to obtain rarefied OTU tables and taxa from rarefied tables, I include the command I would like to use if I have even coverage among samples
Uneven sequencing is common on the Illumina platform. There are a few things you can do upstream in the sample preparation phase that should help with this. (Because this is a problem with sequencing, it's best to fix it before sequencing, instead of trying to deal with it downstream during analysis.)
The big thing my wet-lab colleagues did was normalize the amount of amplicon PCR product added to the Illumina sequencing run.
This is the basic workflow:
extract nucleotides -> PCR -> amplified libraries -> Illumina sequencing
They added a measurement and normalization step:
extract nucleotides -> PCR -> amplified libraries -> measure DNA concentration (Qubit, nanodrop, etc.) -> calculate needed volume to add a consistent mass of DNA -> Illumina
Because the mass of PCR the product was more even, the reads per sample were more even.
Let me know if this makes sense
Are you and your team already doing something like this? If so, what concentrations of DNA are you measuring (ng/ml)?
I will be back after discussion with the genomics laboratory.
meanhile, if you could suggest a way of subsampling which could guarantee no sampling bias I would appreciate it very much.
My idea would be just subsampling the original files (after removing clear outliers in terms of the number of sequennces) .
I would for example use this command to obtain e.g. 10,000 sequences from the original R1 and R2 fastqs and then proceed
You can do this, but there is going to be a tradeoff between keeping more samples and keeping more data in each sample.
The issue is that samples with few reads have lower resolution. (Just like a photo with fewer pixels has a lower resolution.)
One common method of normalization involves subsampling, like to 10k reads per sample as you mentioned. But what do you do with samples that have less than 10k reads? There is no way to increase resolution that you do not have, so many normalization pipelines simply drop these samples from the normalized output.
This is the tradeoff:
keep all samples, removing resolution from the deeply sequenced samples so all are comparable
keep just deeply sequenced samples, removing samples that have fewer reads
This is good! This is the wet-lap normalization method I was suggesting in my first post.
It sounds like they did their best. Sometimes samples are uneven
Unfortunately, I think you are correct. Without another sequencing run, there's not much to be done for these samples...
Make sure you double-check methods when doing statistical tests. Some methods request normalized data, and some do their own normalization and work best on raw data. Sometimes, omitting outliers is all you need!