Differences in representative sequences between QIIME1 and QIIME2

Nicholas_Bokulich · November 14, 2018, 4:40pm

The issue with deblur appears to be that many of the sequences are unique and/or do not resemble the reference sequences you input (e.g., are non-fungal). See the fraction-artifact-with-minsize column in the stats visualization. You can adjust the min-size parameter to correct for this, if unique seqs are to blame (as I suspect) — however the deblur developers have warned against this, e.g., see here (that user has a similar issue with deblur).

Aha, yes that could be the issue — if these sequences are being filtered out as chimera on single-end, they are probably noisy reads that have issues passing filter/merging with the paired-end. QIIME 1 does not use chimera filtering by default so that could explain part of the discrepancy.

Personally, I think I would proceed with the single-end data. It is probably better to have slightly shorter reads rather than proceed with longer joined reads but potentially introduce an amplicon length bias (which is essentially what is occurring, and clearly impacting some samples more than others).

But I recognize that is not satisfying, when QIIME 1 seems to yield better results (the possibility that chimeric seqs are passing through and masquerading as real data may taint that assumption, though, depending on if you used a chimera filter with the qiime1 results). You could also use q2-vsearch to perform OTU clustering and see if that performs better for your data. I linked to the OTU clustering tutorial above. Your workflow would look like this:

use q2-quality-filter to trim/filter sequences
use vsearch dereplicate to dereplicate seqs
use q2-vsearch to cluster
use q2-vsearch to filter chimera

The chimera filter seems to suggest that it is a problem with the data — the sort of problem that can be fixed. Use single-end or try again with q2-vsearch and compare against (chimera-filtered) QIIME 1 results to see how they square up. Let us know what you find!

cdeai · November 15, 2018, 1:55pm

Thanks!

I checked back to my logs and I did actually attempt to do the chimera filtering step in QIIME1, but it removed no sequences.

So, reading on a similar problem (Many chimeric reads after dada2, but only in some samples - #7 by Mehrbod_Estaki) I understand that all of the above means, that the large portion of the chimera sequences is "real" (i.e. arising in the amplicon preparation step, possibly due to low amount of template DNA we were working with) and not a problem with the sequencing or analysis. In other words, it is safe to assume that the sequences that are removed, should be removed and that this improves the result rather than introduce bias to it. The reasonable solution in this case is therefore to use the single-end dada2 protocol.

In any case most samples look much better now, so single-end seems to be the way to go. Am I right?

Nicholas_Bokulich · November 15, 2018, 2:05pm

I agree, I think that is probably the best course of action and the correct interpretation of these results.

The read yields are looking better for those samples that had read joining issues with dada2 denoise-paired, so yes I think this is the safer route (since you are not systematically biasing samples that have longer amplicons). But you have not mentioned anything about how the results look, e.g., regarding the false positives/negatives that you mentioned seeing with dada2 denoise-paired at the start of this topic thread. I would be curious to hear what you find!

cdeai · November 15, 2018, 3:16pm

True, I forgot to mention that. In single-end data the sequences are there, as expected. As a bonus, the number of slightly different representative sequences almost identical to the same taxon is much lower than in QIIME1, so the sequence list looks less inflated (as expected from QIIME2).

In short, single-end dada2 workflow produces the result that is from what I can tell most credible based on the biological background of the samples and throws away the lowest number of sequences (and, as you say, avoids introducing the amplicon length bias).

cdeai · November 15, 2018, 3:20pm

Finally - many thanks for all the help in this long discussion!
.
.
.
And as a reference for anyone who might have a similar problem, a short summary of conclusions:

Fungal ITS 460 bp amplicon, paired-end 2×300 bp Illumina sequencing.
In some samples over 90% reads failed to merge with paired-end dada2.
The results of QIIME1 looked noisy/inflated, but they also contained sequences almost perfectly matching to taxons, which dissapeared in paired-end dada2.
Deblur workflow performed even worse than paired-end dada2.
Single-end dada2 (on just forward reads, after removing the primers with qiime cutadapt trim-single) performed best.
In some samples even with single-end dada2 many (up to two thirds) of sequences were removed as chimeras - but this is most likely not a problem of sequencing/analysis, but an indication of real chimeric sequences in the data (i.e. error introduced during the amplicon preparation, possibly due to very low amount of template in some samples).

system · December 16, 2018, 9:28pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.