Filter features?

stangedal · October 13, 2017, 4:31pm

Hi, I am just starting to use QIIME2 after having been a QIIME1 user for some time. I now have a question with regards to filtering the feature tables accordingly to Q1: filter_otus_from_otu_table.py - --min_count_fraction Fraction of the total observation (sequence) count to apply as the minimum total observation count of an otu for that otu to be retained. this is a fraction, not percent, so if you want to filter to 1%, you specify 0.01. [default: 0]

Boculich addressed low abundance filtering in 2013, recommending: "For datasets where a mock community is not included for calibration, we recommend the conservative threshold of (c = 0.005%). "
c=otu abundance threshold.

After DADA2 quality filtering on my samples - no mock community involved unfortunately - I am wondering if a similar filtering is still recommended, and in that case how do I go about it?

Solveig

Nicholas_Bokulich · October 13, 2017, 4:49pm

@stangedal excellent question.

No, if you are using DADA2 no subsequent abundance filtering needs to be applied. The abundance filtering recommendation was specific to the OTU picking pipelines in QIIME1 (but would probably still apply if you use the OTU picking pipelines in QIIME2).

yanxianl · October 13, 2017, 9:42pm

Hi @Nicholas_Bokulich, why subsequent abundance filtering is not necessary if using dada2? Does that mean sequence variants(SVs) generated in this way is free of spurious SVs? What about filtering FeatureTable according to sequence prevalence in the biological samples investigated? Some would recommend to exclude OTUs present only in one sample whereas the authors of the dada2 paper actually suggested exclusion of taxa unclassified at the phylum level or not present in at least certain percentage of total number of samples that is samples-dependent.

In our studies, we included mocks as positive controls. Does this assure that we can literally use different featuretable filtering methods as long as it produces accurate results for the mock samples?

What's your recommendations for featuretable filtering before proceeding to alpha- and beta-diversity anaylsis in qiime2? No filtering or filtering taxa according to their prevalence?

Nicholas_Bokulich · October 16, 2017, 5:10pm

@yanxianl I am simply stating that the feature abundance filtering protocols recommended in the 2013 Nature Methods paper that @stangedal mentioned are not tested in conjunction with dada2 or deblur, and are most likely unnecessary (based on the results reported in the original papers for dada2 and deblur) or even conflicting. That is not to say that dada2 or any method is perfect — some level of abundance filtering (e.g., to remove singletons and other low-abundance features) may still be useful in some circumstances but this has not been benchmarked so I cannot recommend it.

Excellent — so in your runs you can use mock communities to tune this. In an upcoming QIIME 2 release we will release some new quality control methods that utilize mock communities. Stay tuned for more details.

I would not say that this "assures" the quality of results. Mock communities are not perfect and are prone to human error — but I would say that within reason the method/parameter combinations that maximize mock community accuracy will be best for that individual sequencing run. Those methods may not generalize to other sequencing runs or bioinformatics methods — large-scale benchmarking studies are required to assess general recommendations.

Well this all depends on the upstream processing methods that you used and the biological question. Most beta diversity metrics are quite insensitive to low-abundance taxa (this is described in the 2013 Nature Methods paper mentioned above — but is certainly not without exception), particularly as beta diversity calculations are performed on rarefied feature tables in QIIME 2. In general, though, I'd say don't worry too much about the impacts on beta diversity if using UniFrac methods.

For alpha diversity this is much more difficult to assess and I cannot make any absolute recommendations here. I can say that dada2 performs quite well without additional abundance filtering (as shown in the original paper but also in my own experience).

yanxianl · October 16, 2017, 5:40pm

Dear @Nicholas_Bokulich,

Thanks a lot for your comments! I've made a transition to qiime2 for microbiome data analysis and enjoyed a lot during this process by learning all these new qiime2 features that keep popping up. Your adivices are really helpful to my analysis work at hand.

Have a lovely day!
yanxian

stangedal · October 16, 2017, 6:49pm

This was most helpful - and interesting! Thank you for all your information and shared thoughts in this post!
S

system · November 17, 2017, 12:49am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.

Nicholas_Bokulich · December 1, 2017, 7:07pm

Just to follow up, the mock community assessment methods mentioned in this thread are now available as new actions evaluate-composition and evaluate-seqs as of the 2017.11 release. These are designed with mock communities in mind, but could also be useful for testing simulated communities or other samples types with an “expected” composition/sequences.

I hope these help!