problems of classification with low biomass samples

gregcaporaso · July 20, 2023, 10:38pm

Hi @MichelaRiba,
It's true that this will be slower than clustering, but the difference shouldn't be very large (presuming that you still run quality control with DADA2 in both cases, which I highly recommend). Do you have a question at this stage, and can you clarify if so? Just want to make sure I'm not missing something.

Good luck!
Greg

MichelaRiba · July 21, 2023, 1:02pm

Hi,
Yes I do have a question:

I have problems of classifcation again treating low biomass samples. I do not see good classification starting from raw data, DADA2 and sklearn. I imagine DADA2 performs also clustering to obtain representative sequences, is this correct? In this case I think my problem is still there: clustering is a problem in a situazion in which I have sequences coming from uneven composition.
If I have made some mistake or not understood well

MichelaRiba · July 24, 2023, 10:38am

Hi,

I am writing to confirm that using the procedure of no clustering before taxa assignments performs better than performing DADA2 pipeline I suppose as in my previous message that anyhow with kind of messy low abiundance samples also the DADA2 pipeline which at a certain point needs to extract refrence sequences from kind of clusters suffers of problems which in turn lead to problems with classification, rich in unspecified "bacteria", "OD1". So I would conclude a need for guide/ protocol for how to treat low abundance samples or maybe a QC checklist to be able to exclude samples which would impact in difficulties in classification would be very important, thanks a lot!
Concerning the NO clustering option I would say OK but maybe with a great number of samples and sequences I would imagine scaling problems. Thanks a lot for keeping the discussion and guidance open!
Michela

gregcaporaso · July 26, 2023, 10:31pm

Hi @MichelaRiba, Apologies for missing your earlier question. DADA2 is not clustering in the traditional OTU clustering sense - rather it's performing quality control, and then defining the highest resolution "OTUs" that are possible from the data type. This is a good paper on the topic.

I'm glad to hear that you're now getting better results. I do highly recommend doing some sort of quality control before taxonomy assignment (DADA2 is still my recommended approach - I'm not sure if you've done that in this latest iteration). If these are low biomass host-associated samples, it may also help to apply host read filtering which you can do with qiime quality-control exclude-seqs - there is a tutorial on this here. If you have host reads showing up in your data, that could be what is not being assigned detailed taxonomic information.

MichelaRiba · July 28, 2023, 8:59am

Hi,

thanks a lot for advice. I just would like to clarify that before doing OTU clustuering and classification I do QC on sequences and during fastq merge of the the R1 and R2 a quality filter is applied and inaddition to that during import of joined sequences I do also the following
qiime quality-filter q-score

My point in the can be summarized in the following:
with kind of messy samples (uneven sequencing coverage) both if I use DADA2 or do clustering using vsearch I do end up with problems of classification, if I use all the messy samples without excluding frank outliers. The procedures performe well if a curation in excluding worst samples is done. This does not happen if I use the procedure of vsearch without clustering.
In this light what do you suggest to avoid time consuming identification of outliers to be excluded from the computation? things staying like now I would go for no clustering since I can do without deep outlier curation even if with great number of samples coming (>500) this will have for sure scaling problems I suspect.
The other way around to me is the possibility to have a method to highlight outlier and proceed with daa ASV pipleine or my current one (joining and filtering, quality filter import, vsearc with clustering step).
I receive form the facility samples with very different coverage ranging from 1,000,000 to 6,000 this in the end is a problem since I cannot downsample to 6,000 for example. Thanks, hoping this is sufficiently clear

gregcaporaso · August 1, 2023, 4:52pm

That sounds good, thanks for clarifying.

No clustering is a better approach than clustering, if that works for you. Scaling may or may not be an issue - it likely depends on how much sequencing noise is making it's way through the filters (there will always be some). I recommend starting with this, and reassessing if you have issues with scaling. 500 samples is large, but we have run much larger data sets through QIIME 2, so this is well within range of what QIIME 2 can handle.

How are you defining outliers here? Is it based on the coverage?

MichelaRiba · August 31, 2023, 1:18pm

I, thanks for kind replies and support. Sorry for answering only now, I have taken a break (holiday)
Indeed we have a problem of high level of uneven coverage among samples and if I do not take any step of solving this in a way I run into troubles for alpha diversity for examples since I have noticed that analysing together samples assayed at very different depth in the end impacts on their grouping because in some situations more sequences means more possibility to sample biological entities, and these impact both of alpha and beta diversity and also in grouping samples based on metadata (e.g control vs diseased) for this reason I decided to add a subsampling step to 30,000 sequences this means excluding samples with less (e.g. the sample with 6,000) because it is not easy to have from the sequencing center a solution for this uneven coverage at the moment. Do you think would be better to pursue also a resolution of the problem from the point of view of sequencing? For sure we are loosing information because to atrrive to a common baseline of 30,000 I should leave the remainder...

gregcaporaso · September 5, 2023, 9:04pm

Hi @MichelaRiba, I'm not aware of a way that this could be addressed on the sequencing side, but that is outside of my expertise (which is primarily on the informatics side).

Does anyone know if there are approaches that could be taken during sample preparation or sequencing that would help with making sequencing depth across samples more even?

system · October 7, 2023, 3:04am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.