Feature-Classifier and training/percent alignment

ben · August 15, 2018, 1:34pm

Dear all,

After a bit of a hiatus, I am coming back on to understand a couple of issues I have been having with my data. One problem is sparsity in my data, I'm studying relatively low abundant samples (lung and bronchoalveolar lavage), which using DADA2 I have found is often contaminated/enriched with human/eukaryotic DNA.

I have finished two different runs, one using QIIME1 and QIIME2. QIIME1 I used a closed picking method and found significant differences in beta-diversity (using the 97% sequence identity greengenes database). On the other hand, using QIIME2, I used DADA2 (99% sequence identity), denoise AND a quality-control step (99% sequence identity with 0.97 identity, see below). I lose all the differences I find in QIIME1 when I run the QIIME2 method. I am assuming that some of the feature assignments and quality control I am running is filtering out taxa that may otherwise dictate differences in QIIME1.

My questions are as follows:

When we use the feature classifier, we should be using a "trained" data set using one of the greengenes (or other databases), but what about the cutoff? I used the 99% one that had been provided, but what happens to those samples that do not match the 99% cutoff? Should I retrain on the 99% greengenes or use a lower cutoff when using DADA2 (e.g., 90%? 80%?)

qiime feature-classifier classify-sklearn
--i-classifier /ifs/home/wub02/gg-13-8-99-515-806-nb-classifier.qza
--i-reads /ifs/home/wub02/Projects/Mur.VC1.VC2.Smoke.Cluster/QIIME2_5_merge_filter/Smoke.Mouse.new.filter/smoke-no-hits-filtered-merge.seqs.qza
--o-classification /ifs/home/wub02/Projects/Mur.VC1.VC2.Smoke.Cluster/QIIME2_8_taxonomy/Smoke.Mouse.new.filter/taxonomy.qza

Previously, due to a high amount of "unassigned" sequences, in my upstream to my data set, I also had run an "Quality-control exclude-seqs" step which I had trained a 99_otus.qza and utilizing an identity-score of 97% and query-aligned of 97%. Someone mentioned that this was too strict - I may use a 85% identity greengenes.

qiime quality-control exclude-seqs
--i-query-sequences /ifs/home/wub02/Projects/Mur.VC1.VC2.Smoke.Cluster/QIIME2_4_DADA2/no_revcomp_180518_MSQ80/rep-seqs.qza
--i-reference-sequences /ifs/home/wub02/Projects/Training.feature.classifiers/gg_13_8_otus/import.rep.set/99_otus.qza
--p-method vsearch
--p-perc-identity 0.97
--p-perc-query-aligned 0.97
--p-threads 4
--o-sequence-hits /ifs/home/wub02/Projects/Mur.VC1.VC2.Smoke.Cluster/QIIME2_4_DADA2/no_revcomp_180518_MSQ80/filter.new/hits.qza
--o-sequence-misses /ifs/home/wub02/Projects/Mur.VC1.VC2.Smoke.Cluster/QIIME2_4_DADA2/no_revcomp_180518_MSQ80/filter.new/misses.qza

Sorry this was a confusing question - removed.

All of this, I am aware of the differences of using a de novo picking process such as DADA2 and comparing it to a closed picking method such as UCLUST in QIIME1. I am also aware that false positives may be driving previous differences I have seen prior. I was wondering how I should tune these steps to optimize recovery of low abundant taxa (e.g, NOT stool). Sorry for the complicated question. Let me know how I should proceed. Thank you!

Thank you, Ben

Nicholas_Bokulich · August 15, 2018, 6:41pm

It is possible — you should review the number of reads in each sample with each method, and review the dada2 run stats to determine where you are losing reads (e.g., at read joining?)

The other possibility, though, is that you are detecting some false-positive difference with QIIME 1 — I'm not pointing fingers or claiming any one method is universally superior, just posing an alternative hypothesis.

I would expect dada2 to be more sensitive to differences, so this is sort of the opposite of what I'd expect. I think your hypothesis about issues with run parameters is most likely.

That cutoff is just specifying the percent similarity used for clustering sequences in that database. Higher will be more specific, i.e., have better taxonomic labels and preserve more unique seq information, and should almost always be used (the only reason to go lower is for memory issues). This has no relation to OTU picking/denoising methods or parameters. Use 99%.

Yes, probably too strict — the point here is just to remove things that are obviously non-target DNA. (coincidentally, that's also what's going to happen with 97% closed-reference OTU picking!) So I'd recommend a lower threshold unless if you intend to filter out anything that does not closely resemble the reference database.

q2-feature-classifier's classify-sklearn with the pre-trained classifiers uses a naive bayes classifier similar to that used by RDP (this is NOT the same thing as RDP classifier though). The QIIME 1 default is uclust, which uses uclust for alignment to reference database sequences. Yes, classifier choice will impact classifications, but that probably is not the issue here. You could use one of the alignment-based classifiers in QIIME 2 (e.g., classify-consensus-vsearch) or use RDP classifier in QIIME 1 if you want to make the classification methods more similar between your two workflows \

(better yet, eliminate the number of variables you are comparing here! you can do closed-ref OTU picking in QIIME 2 or just import your QIIME 1 data and classify in QIIME 2 with the same method)

I doubt classification is the issue — I think you should be looking further back in your workflow (at the denoising step) if you suspect a false-negative error with dada2.

You should take a close look at the features that are driving the differences in QIIME 1.

What species do they belong to? could these be contaminants?
if possible, examine the read quality or the raw seqs that those sequences came from (I suspect these could be noisy sequences that dada2 is filtering out. not sure why they would only be in one of your sample groups)

Are you comparing samples from multiple sequencing runs by any chance?

ben · August 15, 2018, 7:02pm

Thanks for the reply, I am using data from several different runs:

Each run is DADA2 and Denoised separately
I am also quality filtering separately
I am then combining the runs together and doing the downstream in the larger group

I'm currently re-running the quality control steps with 88% sequencing targets

Nicholas_Bokulich · August 15, 2018, 7:05pm

Are the differences you see in QIIME 1 associated with these separate batches? (sorry — I expect you're already controlling for batch effects but just want to make sure)

ben · August 15, 2018, 7:08pm

Not for the samples in question, they are all one run. We combined several runs together, but these in particular were only in one run because they were all lung samples. We try not to combine contaminated/heavy bacterial burden samples with less contaminated/low bacterial burden samples.

ben · August 20, 2018, 4:29am

Hi all, thanks for all the help, I pulled the samples and compared the QIIME1 and QIIME2 counts. As you probably suspected very different:

180819_qiime1vsqiime2dadacomp.upload.txt (4.0 KB)

After DADA2, Denoise, and chimeric checking, I still end up with approximately 50% more reads than I do with Qiime1 (eg., qiime1 20,000 vs. qiime2 30,000 reads). These are before a quality check code run:

qiime quality-control exclude-seqs
--i-query-sequences ~/QIIME2_4_DADA2/180215_MSQ73/rep-seqs.qza
--i-reference-sequences ~/Training.feature.classifiers/gg_13_8_otus/import.rep.set/99_otus.qza
--p-method vsearch
--p-perc-identity 0.97
--p-perc-query-aligned 0.97
--p-threads 4
--o-sequence-hits ~/QIIME2_4_DADA2/180215_MSQ73/filter.new/hits.qza
--o-sequence-misses ~/QIIME2_4_DADA2/180215_MSQ73/filter.new/misses.qza

qiime feature-table filter-features
--i-table ~/QIIME2_4_DADA2/180215_MSQ73/table.qza
--m-metadata-file ~/QIIME2_4_DADA2/180215_MSQ73/filter.new/misses.qza
--o-filtered-table ~/QIIME2_4_DADA2/180215_MSQ73/no-hits-filtered-table.qza
--p-exclude-ids

If I can pull the read counts from the filtered table: no-hits-filtered-table.qza (is it as simple as converting to qzv?) Also, any other ideas and how I should go about analyzing the data, could this be an artifact of the trim length or trunc length? Please let me know!

Edit:

I should add that the counts from qiime1.9.1 are the output from the summarize command from a closed-picked OTU-biom file.
The output from the QIIME2 is from 2018.04 version, and the output is from the summarize command after the final DADA2/denoise command
Qiime2 counts are before I do a filter step which is described above
One more comment, these are considered low bacterial burden samples (e.g., lungs/BAL as stated above) thus, as described in the Salter et al. 2014 paper a lot of background contamination is evident and difficult to control

Ben

Nicholas_Bokulich · August 20, 2018, 4:09pm

This is most likely because closed reference OTU picking at 97% will result in losing quite a few sequences (anything < 97% similar to at least one reference sequence!). That may or may not be desirable depending on your experimental designs. If you used de novo OTU picking in qiime 1 you would probably end up with as many or more OTUs than dada2...

yes, use feature-table summarize

I feel your pain! I expect dada2 and OTU picking are not going to handle this type of contamination differently (i.e., it will impact both similarly).

We have an ongoing discussion about contamination control, if you have any thoughts or maybe just want to check out some of the ideas folks have been posting: Discussion: methods for removing contaminants and cross-talk

ben · August 20, 2018, 4:35pm

Thanks, I've actually did quality control steps utilizing different %databases and I will post the results in the thread. I agree with you, I think it's the nature of these samples and how dada2 and how OTU picking is handling these samples. Let me show you the work I've done and maybe there's something else we can do.

Again, thank you for the help.

Ben

ben · August 21, 2018, 6:12am

180819_qiime1vsqiime2dadacomp.upload.txt (7.2 KB)

Ok, so I summarized the reads per sample after each quality control step using a classifier for 99_otus and 88_otus database.

On the whole, the DADA2 samples had 500-3000 more reads than the qiime1.9.1 closed OTU picking methods samples (w/ a 99_otus quality check). This represented anywhere from 5% to 15% of the reads per sample.

When you run a quality filter using a less stringent method, the added reads with the DADA2 method added significantly more reads per samples (likely a lot of it is poor/eukaryotic DNA).

Given the low signal to noise ratio, I can see that 5-15% of the sample impacting the beta-diversity of the differences I observed between qiime1 and qiime2 (Closed vs. DADA2 picking). I think in essence this may be the inherent issue w/ low abundant samples, and it may be that lung/BAL samples may benefit from the precision with closed picking vs. de novo methods.

Any other ideas? I think I may try closed OTU picking with these samples. I have to carefully review the decontamination in the other thread and consider applying it to our samples. Thank you for your help.

Ben

Nicholas_Bokulich · August 21, 2018, 3:51pm

Are these differences based on beta diversity of sequence features (OTUs, ASVs) or are these based on taxonomy?

using quality-control exclude-seqs at 97% will essentially perform the same sort of filtering that closed-ref OTU picking will, so this is not a matter of "precision" or any kind of stringency. What is different (and probably the main driver of these differences if your analysis is based on features and not taxonomy) is that OTU picking is clustering these similar sequences, essentially aggregating ASVs (and noise) into larger groups (larger differences, larger effect size, more power).

To test that hypothesis (and determine whether noise may be responsible for the statistical differences you see with OTU clustering alone), you can cluster your denoised sequences — use vsearch cluster-features-closed-reference to cluster the output from dada2.

(those steps are also how you would perform OTU picking in :qiime2:, btw)

I hope that helps!

ben · August 21, 2018, 3:59pm

Apologies, these are based on taxonomy. What I did was export the biom, taxonomy, tree, and metadata file into R and created a phyloseq object utilizing R and vegan. Clustered the lung samples according to exposure (e.g., exposure1, exposure2, control, exposure1+exposure2) and plotted them using PCoA weighted UniFrac.

The same samples when I run them using the biom file, tree, and metadata from QIIME 1.9.1 showed significant differential clustering. However, when I run them using Qiime2 DADA/denoised and 99% quality filtered these differences are less pronounced.

That's what I'm thinking is happening, can't this also be considered overfitting? In the sense that not every ASV is supposed to be a unique microbe? Maybe I'm understanding the concept incorrectly, but an OTUs can be represented by several ASV?

Thanks, is this per run? Then do a merge step? Thank you again for the input. Ben

Nicholas_Bokulich · August 21, 2018, 5:46pm

So it is based on sequences, not on taxonomy.

I don't think that can be considered overfitting but I think I see where you are going with that.

Every ASV is not necessarily a unique microbe, but they are unique sequences — and do a better job of removing noisy sequences than OTU picking. ASVs will theoretically capture distinct species, strains, even multi-copy heterogeneity within a single cell, provided that they have distinct sequences. So it is ambiguous exactly what a unique ASV represents, but that distinction can still be important. In your case I think we are seeing the opposite and maybe that's okay.

Correct

OTU picking does not need to be done on a per-run basis. I'd recommend merging and then clustering.

Good luck!

ben · August 21, 2018, 5:50pm

Great! Thank you very much. Ben

edit:

Quick question, given that we DADA and then denoised are not on paired reads, do I have to start from higher up on the pipeline to join the reads/denoise them and then dereplicate? Just wondering if there's a tutorial that I missed to get this done.

Could I also pair the reads from the out-table/sequences from the DADA2-denoise run? Ben

Nicholas_Bokulich · August 21, 2018, 9:36pm

dada2 will join paired-end reads for you after denoising... it's all built in to the pipeline.

also fyi: please always make new posts instead of editing existing posts to ask new questions or change the information given. Only a new post automatically registers as queued, so it would be very easy to miss your edited question.