Hi,
I was wondering if you could possibly offer any helpful explanations we are seeing between the representative sequence outputs of QIIME1 and QIIME2.
We are using the standard workflow described on the QIIME website for the analysis of fungal ITS data (Illumina). Here are the results we get with the same dataset:
In QIIME1 using a 97% cutoff we get 1570 representative sequences.
In QIIME2 we get 132 representative sequences.
Now, I know that the clustering algorithms are completely different between both versions of QIIME and that QIIME1 inflated the number of OTUs. That’s fine and explains the different numbers of rep seqs.
The problem we have with this output is that we seem to be losing whole taxons in QIIME2. For example, if I blast the representative sequences with an ITS of a specific fungal species, the matches are as follows:
For QIIME1:
Hit e-value percent_identity
New.CleanUp.ReferenceOTU729 0.0 98.574
New.ReferenceOTU91 0.0 95.382
New.CleanUp.ReferenceOTU2422 0.0 92.323
New.CleanUp.ReferenceOTU994 0.0 91.242
New.CleanUp.ReferenceOTU2364 1.95e-166 95.616
All these sequences, when blasted against the NR GenBank database, are matched to the species that was searched for, or relatives from the same genus.
For QIIME2:
Hit e-value percent_identity
760c2f565f24f093c8d8b433d5946594 1.78e-62 94.040
760c2f565f24f093c8d8b433d5946594 1.90e-17 95.918
3ec608156d46f6e10b772b0f5142f6a4 6.46e-57 90.123
3ec608156d46f6e10b772b0f5142f6a4 8.84e-16 93.878
a0b877aa80232376ae6e0cd4b86e69bb 6.46e-57 90.123
The e-values here are evidently much lower. The hits, when searched against the GenBank, are matched to species that are not only different from the one used in the initial query (against representative sequences), but belong to a different fungal subphylum.
The taxonomy output of QIIME corresponds to this - taxa that were clearly present in QIIME1, are completely gone in QIIME2.
Are we making some obvious mistake here?