filtering ASVs by abundance

hsapers · June 27, 2020, 3:17am

Hello,
I've really been enjoying reading through the extensive collection of resources here. I'm re-analyzing my sequencing data using ASVs rather than OTUs. My data is such that there is a large number of low abundances features. When working with OTUs I would filter such that I retained OTUs that comprised at least 0.1% of total observed reads (rather than removing singletons). My understanding is that there is some sort of built-in algorithm when inferring ASVs using Dada2 that infers true biological sequences and spurious features are filtered out based on error modeling such that the singleton removal and threshold removal required with OTU based clustering is deprecated. My understanding is this can be run either sample by sample (least computationally expensive) or on all samples together (increased probability of resolving low-abundance ASVs) or using a 'pseudo-pooling' method. Currently it seems that the qiime2 plug in only allows for the per-sample implementation of dada2. Does this mean that I should interpret all resulting ASVs as 'true' biological sequences - even those that appear in < 0.1% of the data? I'm also working with deep biosphere samples, so low DNA yield and features that are poorly represented in reference databases. I have many ASVs only classified to the domain or phyla level - I'm assuming that using taxonomy collapse here would not be that useful since rep-seqs with the same taxonomic classification could be quite divergent? I also want to make sure that I'm understanding the feature table summary statistics correctly: Number of features: 7427, frequency: 2079818. This means that I have 7427 ASVs observed a total of 2079818 times across all of my samples. If I wanted ASVs that comprise at least 1% of my total observed reads I could filer to ASVs observed at least 20798 times. This would likely remove a significant number of ASVs as when looking at the 'frequency per feature' table, the min ASV frequency is 1 and the mean frequency is 280, meaning that the mean frequency is ASVs comprising only 0.01% of all observed reads, or an order of magnitude less frequent than the OTUs I filtered out as spurious. I'm hoping this makes some sense and I'm not just talking in circles.
Thank you!

Since all of this filtering is based on relative abundance, before doing any type of thresholding, I'm assuming that I should only do this on experimental samples (after filtering out control samples) after assessing for contamination. This is where I over-think again - if ASVs are being constructed on a per sample basis, removing samples at this stage is permissive. How would this change if I constructed ASVs by pooling samples - could I still removed samples without fundamentally affecting the compositional nature of the data?

yanxianl · June 27, 2020, 4:24pm

Hi,

I used to have the same question. Here's my two cents:

Yes. The abundance based filtering ("threshold removal"), which was implemented in QIIME1 to remove spurious OTUs, is generally considered as unnecessary for feature tables generated by sequence denoisers such as DADA2 and Deblur. The singletons detected in each sample are still removed, when the DADA2 processess sequences independently for each sample. In some cases, such as differential abundance testing, filtering low-abundant and/or low-prevalent features is still warranted to reduce the burden of multiple hypothesis testing. See more discussions here about the abundance based filtering.

Yes. Samples are processed independently by DADA2 in Qiime2 at the moment, which removes singletons detected in each sample. But "pseudo-pooling", which allows for the detection of rare features, will be available in the coming release of QIIME2 this month.

Yes. The inferred ASVs are still reliable even if they account for < 0.1% of the total reads.

Yes, that's correct.

"Data that are naturally described as proportions or probabilities, or with a constant or irrelevant sum, are referred to as compositional data (Gloor et al., 2017)." The compositinal nature of the microbiome data refers to the pratice of normalizing sequencing depth by total sum scaling, i.e., relative abundance. When we talk about the relative abundance of features, it usually refers to the proportion of reads assigned to a particular feature in a sample, not in all the samples. Removing samples in your project, whether processed by the DADA2 independently or jointly, is not related to the compositinal data analysis.

hsapers · June 29, 2020, 12:58am

Thanks! your explanations really confirmed my suspicions.