Determining reasonable amount of ASVs/OTUs in analysis

szymanski · March 7, 2023, 8:23pm

In doing Microbiome analysis, one step that has generated much confusion for me has been understanding how reasonable and ""real"" the number of ASVs or OTUs generated in an OTU/ASV table is and how to really think through these differences.
Of course there are many factors that would influence the answer to this question. Off the top of my head the main ones that come to mind are:

OTU vs ASV: Given that OTUs typically collapse multiple varied-yet-close-sequences, I would expect more ASVs than OTUs all else equal. Nested within this is also the threshold for OTU collapse (97% often)
Amplicon of Interest: ITS vs 16S vs 18S I would imagine influence how much variants you expect to observe
Sampling Design: Extraction from humans vs soil vs plant roots vs plant leaves all would influence how much you expect potentially.
Sequencing Platform: Illumina SE vs PE, 150 vs 250 vs 300bp, long read sequencing, etc.

So, with this in mind, how does one reasonably determine during their analysis, after getting sequencing data back, if their observed count of ASVs or OTUs is excessive or even informative? To what extent does it even matter? One of my largest worries is that in doing statistics downstream of this that an excessive ASV or OTU count could be a hindrance in identifying true differences in diversity, population structure, etc.

While this is a general discussion question this is still ultimately driven by seeking help as well. The circumstances that spurred me to ask this are as follows: In examining ITS amplicons from various above-ground tissues of a plant, I obtained over 9000 ASVs (dada2-denoise-single) and 3000 OTUs (uparse) from 300bp single end illumina reads. In a previous experiment that was able to be published, I had ~1200 OTUs only (I did not try ASVs for this one), so getting 3000 OTUs was astonishing, and while I knew I should suspect that I would generate more ASVs, 9000 seemed excessive.

If so, what recourse is appropriate to take for handling this? I know I can alter various parameters such as truncation length of the reads used to generate the features, acceptable error rate, chimera detection, etc, but when entering these areas I get exceedingly cautious as I don't want to too severely alter experimental output from arbitrary decision making.

Nicholas_Bokulich · March 8, 2023, 8:41am

Hi @szymanski ,

This is a good question. There is not a good answer.

If you are asking about the number of ASVs in a complete dataset (as opposed to a signle sample), this then depends on so many more factors, like how many different sample types are included therein, the number of samples, etc. So it becomes a more complex and variable question, and for this I would generalize by saying that the total number probably does not matter (much), or rather that it is much harder to estimate what is a correct value.

The number of OTUs/ASvs per sample is going to be more important and should be less variable (assuming that you are comparing samples of the same type). For this, I would add to your list of main factors: rarefaction depth is a very significant factor for measuring alpha diversity in a single sample. This alone could explain differences between studies. QC and filtering settings are other obvious technical factors that will impact this significantly between studies. So are differences in sample preparation, DNA extraction, primer choice, library preparation, etc. For these reasons, it is quite difficult to compare between studies without closely controlling all technical parameters.

It is also difficult to know what is a reasonable amount, as these values (e.g., in the literature), are basically always empirically derived anyway when studying natural ecosystems. We know from experiments with mock communities that inappropriate data handling can lead to inflated OTU counts of 1-2 orders of magnitude, but it can be difficult or impossible to derive "true" values from natural ecosystems. Mock communities can be used as an in-run positive control, but this can also be a mixed blessing (as errors and contaminants will occur, and can be difficult to assess).

In your case, not knowing too many details: ~1000 OTUs sounds more reasonable than 3000 OTUs, and 9000 ASVs sounds really unlikely for plant tissue. It could be technical variation between your studies, as mentioned above. But it could also be noisy data in your current study. I would carefully check the trimming, filtering, and other QC settings; ensure that primers and barcodes are trimmed before denoising; and consider filtering very rare ASVs/checking their taxonomic affiliations (to rule out possible [cross-]contaminants).