I’ve really been enjoying reading through the extensive collection of resources here. I’m re-analyzing my sequencing data using ASVs rather than OTUs. My data is such that there is a large number of low abundances features. When working with OTUs I would filter such that I retained OTUs that comprised at least 0.1% of total observed reads (rather than removing singletons). My understanding is that there is some sort of built-in algorithm when inferring ASVs using Dada2 that infers true biological sequences and spurious features are filtered out based on error modeling such that the singleton removal and threshold removal required with OTU based clustering is deprecated. My understanding is this can be run either sample by sample (least computationally expensive) or on all samples together (increased probability of resolving low-abundance ASVs) or using a ‘pseudo-pooling’ method. Currently it seems that the qiime2 plug in only allows for the per-sample implementation of dada2. Does this mean that I should interpret all resulting ASVs as ‘true’ biological sequences - even those that appear in < 0.1% of the data? I’m also working with deep biosphere samples, so low DNA yield and features that are poorly represented in reference databases. I have many ASVs only classified to the domain or phyla level - I’m assuming that using taxonomy collapse here would not be that useful since rep-seqs with the same taxonomic classification could be quite divergent? I also want to make sure that I’m understanding the feature table summary statistics correctly: Number of features: 7427, frequency: 2079818. This means that I have 7427 ASVs observed a total of 2079818 times across all of my samples. If I wanted ASVs that comprise at least 1% of my total observed reads I could filer to ASVs observed at least 20798 times. This would likely remove a significant number of ASVs as when looking at the ‘frequency per feature’ table, the min ASV frequency is 1 and the mean frequency is 280, meaning that the mean frequency is ASVs comprising only 0.01% of all observed reads, or an order of magnitude less frequent than the OTUs I filtered out as spurious. I’m hoping this makes some sense and I’m not just talking in circles.
Since all of this filtering is based on relative abundance, before doing any type of thresholding, I’m assuming that I should only do this on experimental samples (after filtering out control samples) after assessing for contamination. This is where I over-think again - if ASVs are being constructed on a per sample basis, removing samples at this stage is permissive. How would this change if I constructed ASVs by pooling samples - could I still removed samples without fundamentally affecting the compositional nature of the data?