Feature-Classifier and training/percent alignment

Thanks for the reply, I am using data from several different runs:

  1. Each run is DADA2 and Denoised separately
  2. I am also quality filtering separately
  3. I am then combining the runs together and doing the downstream in the larger group

I’m currently re-running the quality control steps with 88% sequencing targets

Are the differences you see in QIIME 1 associated with these separate batches? (sorry — I expect you’re already controlling for batch effects but just want to make sure)

Not for the samples in question, they are all one run. We combined several runs together, but these in particular were only in one run because they were all lung samples. We try not to combine contaminated/heavy bacterial burden samples with less contaminated/low bacterial burden samples.

1 Like

Hi all, thanks for all the help, I pulled the samples and compared the QIIME1 and QIIME2 counts. As you probably suspected very different:

180819_qiime1vsqiime2dadacomp.upload.txt (4.0 KB)

After DADA2, Denoise, and chimeric checking, I still end up with approximately 50% more reads than I do with Qiime1 (eg., qiime1 20,000 vs. qiime2 30,000 reads). These are before a quality check code run:

qiime quality-control exclude-seqs
--i-query-sequences ~/QIIME2_4_DADA2/180215_MSQ73/rep-seqs.qza
--i-reference-sequences ~/Training.feature.classifiers/gg_13_8_otus/import.rep.set/99_otus.qza
--p-method vsearch
--p-perc-identity 0.97
--p-perc-query-aligned 0.97
--p-threads 4
--o-sequence-hits ~/QIIME2_4_DADA2/180215_MSQ73/filter.new/hits.qza
--o-sequence-misses ~/QIIME2_4_DADA2/180215_MSQ73/filter.new/misses.qza

qiime feature-table filter-features
--i-table ~/QIIME2_4_DADA2/180215_MSQ73/table.qza
--m-metadata-file ~/QIIME2_4_DADA2/180215_MSQ73/filter.new/misses.qza
--o-filtered-table ~/QIIME2_4_DADA2/180215_MSQ73/no-hits-filtered-table.qza

If I can pull the read counts from the filtered table: no-hits-filtered-table.qza (is it as simple as converting to qzv?) Also, any other ideas and how I should go about analyzing the data, could this be an artifact of the trim length or trunc length? Please let me know!


  1. I should add that the counts from qiime1.9.1 are the output from the summarize command from a closed-picked OTU-biom file.
  2. The output from the QIIME2 is from 2018.04 version, and the output is from the summarize command after the final DADA2/denoise command
  3. Qiime2 counts are before I do a filter step which is described above
  4. One more comment, these are considered low bacterial burden samples (e.g., lungs/BAL as stated above) thus, as described in the Salter et al. 2014 paper a lot of background contamination is evident and difficult to control


This is most likely because closed reference OTU picking at 97% will result in losing quite a few sequences (anything < 97% similar to at least one reference sequence!). That may or may not be desirable depending on your experimental designs. If you used de novo OTU picking in qiime 1 you would probably end up with as many or more OTUs than dada2…

yes, use feature-table summarize

I feel your pain! I expect dada2 and OTU picking are not going to handle this type of contamination differently (i.e., it will impact both similarly).

We have an ongoing discussion about contamination control, if you have any thoughts or maybe just want to check out some of the ideas folks have been posting: Discussion: methods for removing contaminants and cross-talk

1 Like

Thanks, I’ve actually did quality control steps utilizing different %databases and I will post the results in the thread. I agree with you, I think it’s the nature of these samples and how dada2 and how OTU picking is handling these samples. Let me show you the work I’ve done and maybe there’s something else we can do.

Again, thank you for the help.


180819_qiime1vsqiime2dadacomp.upload.txt (7.2 KB)

Ok, so I summarized the reads per sample after each quality control step using a classifier for 99_otus and 88_otus database.

On the whole, the DADA2 samples had 500-3000 more reads than the qiime1.9.1 closed OTU picking methods samples (w/ a 99_otus quality check). This represented anywhere from 5% to 15% of the reads per sample.

When you run a quality filter using a less stringent method, the added reads with the DADA2 method added significantly more reads per samples (likely a lot of it is poor/eukaryotic DNA).

Given the low signal to noise ratio, I can see that 5-15% of the sample impacting the beta-diversity of the differences I observed between qiime1 and qiime2 (Closed vs. DADA2 picking). I think in essence this may be the inherent issue w/ low abundant samples, and it may be that lung/BAL samples may benefit from the precision with closed picking vs. de novo methods.

Any other ideas? I think I may try closed OTU picking with these samples. I have to carefully review the decontamination in the other thread and consider applying it to our samples. Thank you for your help.


Are these differences based on beta diversity of sequence features (OTUs, ASVs) or are these based on taxonomy?

using quality-control exclude-seqs at 97% will essentially perform the same sort of filtering that closed-ref OTU picking will, so this is not a matter of "precision" or any kind of stringency. What is different (and probably the main driver of these differences if your analysis is based on features and not taxonomy) is that OTU picking is clustering these similar sequences, essentially aggregating ASVs (and noise) into larger groups (larger differences, larger effect size, more power).

To test that hypothesis (and determine whether noise may be responsible for the statistical differences you see with OTU clustering alone), you can cluster your denoised sequences — use vsearch cluster-features-closed-reference to cluster the output from dada2.

(those steps are also how you would perform OTU picking in :qiime2:, btw)

I hope that helps!

1 Like

Apologies, these are based on taxonomy. What I did was export the biom, taxonomy, tree, and metadata file into R and created a phyloseq object utilizing R and vegan. Clustered the lung samples according to exposure (e.g., exposure1, exposure2, control, exposure1+exposure2) and plotted them using PCoA weighted UniFrac.

The same samples when I run them using the biom file, tree, and metadata from QIIME 1.9.1 showed significant differential clustering. However, when I run them using Qiime2 DADA/denoised and 99% quality filtered these differences are less pronounced.

That’s what I’m thinking is happening, can’t this also be considered overfitting? In the sense that not every ASV is supposed to be a unique microbe? Maybe I’m understanding the concept incorrectly, but an OTUs can be represented by several ASV?

Thanks, is this per run? Then do a merge step? Thank you again for the input. Ben

So it is based on sequences, not on taxonomy.

I don’t think that can be considered overfitting but I think I see where you are going with that.

Every ASV is not necessarily a unique microbe, but they are unique sequences — and do a better job of removing noisy sequences than OTU picking. ASVs will theoretically capture distinct species, strains, even multi-copy heterogeneity within a single cell, provided that they have distinct sequences. So it is ambiguous exactly what a unique ASV represents, but that distinction can still be important. In your case I think we are seeing the opposite and maybe that’s okay.


OTU picking does not need to be done on a per-run basis. I’d recommend merging and then clustering.

Good luck!

1 Like

Great! Thank you very much. Ben


Quick question, given that we DADA and then denoised are not on paired reads, do I have to start from higher up on the pipeline to join the reads/denoise them and then dereplicate? Just wondering if there’s a tutorial that I missed to get this done.

Could I also pair the reads from the out-table/sequences from the DADA2-denoise run? Ben

1 Like

dada2 will join paired-end reads for you after denoising… it’s all built in to the pipeline.

also fyi: please always make new posts instead of editing existing posts to ask new questions or change the information given. Only a new post automatically registers as queued, so it would be very easy to miss your edited question.

Plugin error from vsearch:

Argument to parameter ‘sequences’ is not a subtype of SampleData[JoinedSequencesWithQuality] | SampleData[SequencesWithQuality] | SampleData[Sequences].

Debug info has been saved to /tmp/qiime2-q2cli-err-o3wen8o6.log

Thanks, do you know why I’m getting this error after trying to run this code off of a DADA2 run?

qiime vsearch dereplicate-sequences
–i-sequences ~/QIIME2_4_DADA2/180209_MSQ72/rep-seqs.qza
–o-dereplicated-table ~/QIIME2_4_DADA2/180209_MSQ72/dereplicated-table-MSQ72.qza
–o-dereplicated-sequences ~/QIIME2_4_DADA2/180209_MSQ72/dereplicated-sequences-MSQ72.qza

Wouldn’t the DADA2 output fit SampleData[JoinedSequencesWithQuality]?

sorry — don’t use dereplicate-sequences because the dada2 output is already dereplicated. Just skip straight to cluster-features-closed-reference.

I have edited my prior answer about to reflect this in case others are following the advice in this thread…

1 Like

Thanks, I figured there was something that was whooshing over my head. Thanks again. Ben


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.