Dear all,
After a bit of a hiatus, I am coming back on to understand a couple of issues I have been having with my data. One problem is sparsity in my data, I'm studying relatively low abundant samples (lung and bronchoalveolar lavage), which using DADA2 I have found is often contaminated/enriched with human/eukaryotic DNA.
I have finished two different runs, one using QIIME1 and QIIME2. QIIME1 I used a closed picking method and found significant differences in beta-diversity (using the 97% sequence identity greengenes database). On the other hand, using QIIME2, I used DADA2 (99% sequence identity), denoise AND a quality-control step (99% sequence identity with 0.97 identity, see below). I lose all the differences I find in QIIME1 when I run the QIIME2 method. I am assuming that some of the feature assignments and quality control I am running is filtering out taxa that may otherwise dictate differences in QIIME1.
My questions are as follows:
- When we use the feature classifier, we should be using a "trained" data set using one of the greengenes (or other databases), but what about the cutoff? I used the 99% one that had been provided, but what happens to those samples that do not match the 99% cutoff? Should I retrain on the 99% greengenes or use a lower cutoff when using DADA2 (e.g., 90%? 80%?)
qiime feature-classifier classify-sklearn
--i-classifier /ifs/home/wub02/gg-13-8-99-515-806-nb-classifier.qza
--i-reads /ifs/home/wub02/Projects/Mur.VC1.VC2.Smoke.Cluster/QIIME2_5_merge_filter/Smoke.Mouse.new.filter/smoke-no-hits-filtered-merge.seqs.qza
--o-classification /ifs/home/wub02/Projects/Mur.VC1.VC2.Smoke.Cluster/QIIME2_8_taxonomy/Smoke.Mouse.new.filter/taxonomy.qza
- Previously, due to a high amount of "unassigned" sequences, in my upstream to my data set, I also had run an "Quality-control exclude-seqs" step which I had trained a 99_otus.qza and utilizing an identity-score of 97% and query-aligned of 97%. Someone mentioned that this was too strict - I may use a 85% identity greengenes.
qiime quality-control exclude-seqs
--i-query-sequences /ifs/home/wub02/Projects/Mur.VC1.VC2.Smoke.Cluster/QIIME2_4_DADA2/no_revcomp_180518_MSQ80/rep-seqs.qza
--i-reference-sequences /ifs/home/wub02/Projects/Training.feature.classifiers/gg_13_8_otus/import.rep.set/99_otus.qza
--p-method vsearch
--p-perc-identity 0.97
--p-perc-query-aligned 0.97
--p-threads 4
--o-sequence-hits /ifs/home/wub02/Projects/Mur.VC1.VC2.Smoke.Cluster/QIIME2_4_DADA2/no_revcomp_180518_MSQ80/filter.new/hits.qza
--o-sequence-misses /ifs/home/wub02/Projects/Mur.VC1.VC2.Smoke.Cluster/QIIME2_4_DADA2/no_revcomp_180518_MSQ80/filter.new/misses.qza
- Sorry this was a confusing question - removed.
All of this, I am aware of the differences of using a de novo picking process such as DADA2 and comparing it to a closed picking method such as UCLUST in QIIME1. I am also aware that false positives may be driving previous differences I have seen prior. I was wondering how I should tune these steps to optimize recovery of low abundant taxa (e.g, NOT stool). Sorry for the complicated question. Let me know how I should proceed. Thank you!
Thank you, Ben