For quality control purposes, it’s often useful to track the number of reads in each step of sequence quality control pipneline using dada2 or deblur. Additionally, we may sometimes get amplicons that deviate a lot from the expected size, which should be excluded from the FeatureTable. Is it possible to add these info to the FeatureTable summary ? I’d definitely love to see these new features in future releases of qiime2.
This type of DADA2 reporting was discussed in this forum topic. We’ll follow up here when the feature is available in a qiime2 release!
@wasade does q2-deblur log this kind of filtering information?
@Nicholas_Bokulich, would this type of filtering make sense in the quality-control plugin you’re working on?
I agree tracking this type of filtering information would be useful, though I’m not sure it’d be possible to display this information in the feature-table summarize visualization. The reason is that QIIME 2 is decentralized, such that there is no set of sequential steps in an analysis like we had in QIIME 1. QIIME 2 is more of a “choose-your-own-adventure”, where you could, for example, choose to denoise your data with DADA2, or perform quality-score based filtering with q2-quality-filter and then denoise the data with Deblur. You can even avoid denoising algorithms altogether and cluster your sequences into OTUs, similar to QIIME 1.
Each of these steps could perform filtering in different ways, and track that information differently too (it’s up to the plugin developer how they want to do that). Thus, by the time a user generates a feature table summary with feature-table summarize, we don’t have access to any of those “upstream” quality-filtering steps; all we have is a feature table that tells us how many features we have in our data set, and how abundant those features are. An artifact’s provenance tells us what actions were executed to create the feature table, but we don’t know how many sequences were filtered out in “upstream” analyses.
Besides the DADA2 filtering reporting I linked to above, you can accomplish what you’re looking for by using demux summarize to inspect how many sequences each sample has prior to quality-filtering / denoising. You can then apply whatever denoising or clustering analyses you’d like, and use feature-table summarize to see how many sequences remain in your feature table (that info is listed as Total frequency under the Table summary heading).
So while it would be difficult to display all of the information you’re describing in feature-table summarize, it’s possible to find that info by comparing demux summarize to feature-table summarize.
@yanxianl where are these amplicons coming from? improper alignment of paired-end reads?
@jairideout amplicon length is not really something we would explicitly address with the filtering methods I had planned. I agree with @yanxianl that it could be useful to have this information in feature table summaries — but then summarize would require an input seqs file. Instead, perhaps tabulate-seqs should include size distributions? To complement this, it might make sense to add a length filter and/or trimming step to filter-seqs to address any issues with sequence length, e.g,. misaligned paired ends leading to extra long seqs.
I read the post you shared and it basically answered my question. It’s often useful to have a track of read number after quality filtering, merging and chimera removal so that we can do the troubleshooting in an easier way when the DADA2 or Deblur pipeline doesn’t work properly.Great to know that such kind of function will be available in the future qiime2 releases.
The alternative way that you suggested to get the percentage of reads passing the quality control pipeline in qiime2 is straightforward. But I think tracking the read number during each step of quality control pipeline is probably more useful to users, such as what’s suggested for the dada2 pipeline in that post you highlighted.
I analyzed a small 16s rRNA dataset generated by using the earth microbiome project primers targeting the 515f/806r region via both qiime2 and R. To get a better understanding of the dada2 pipeline, I first read the original paper, visited the dada2 github repository and went through the tutorial provided there, which included read number tracking and amplicon size summary after the feature table was generated. The normal range of v4 amplicon size was said to be 250-256 and I did find dozens of sequence variants falling out of the range. Unfortunately, I did not explore the reason (probably due to unspecific binding of primers? ) and just filtered these SVs.
The above experience made me wonder if such amplicon size filtering is included in the qiime2 pipeline as well when using the dada2 for sequence quality control. I think the “tabulate-seqs” function that includes size distribution is very useful and necessary for sequence quality control.
The q2-deblur option for --p-sample-stats tracks information per sample such as the number of dereplicated reads, deblur clusters, pre/post filtering counts. In writing this reply, I realized there isn’t an example I could find to readily link to, although you can see a screenshot of the stats output here. I opened an issue about adding in an example of the stats output into the primary tutorial so one could be linked to in the future.