Too many unassigned or only at kingdom level features

Mehrbod_Estaki · February 7, 2018, 8:08pm

Hi folks, I'm looking for some help in troubleshooting the source of error in a recent analysis I did that is turning up with large numbers of unassigned features and even more assignments only at the Kingdom level (as seen in the barplots here)

Summary: 16s, mouse colon, 2x300 paired-end, MiSeq run using V3-V4 primers (341F & 805R). q2 version 2017.12.

I denoised using DADA2 with somewhat lenient truncating parameters based on the QC plots which can be found in the provenances. It ran successfully with decent enough coverage. I noticed immediately that there are quite a few features that are much shorter than others (rep-seqs here) and from blasting a few of them they look to me like host genes. With our primers we expect an amplicon size of ~450.
I trained the classifier with gg_13_8_99 using our primer sets below but without truncating parameters as was advised elsewhere for PE reads:

--p-f-primer CCTACGGGNGGCWGCAG \
--p-r-primer GACTACHVGGGTATCTAATCC

As you can see the taxonomy seems to largely suffer from unassigned or poorly assigned features and I'm looking for help to either a) figure out where the potential source of error comes from, or
b) I should just focus on finding a way to remove those reads and reanalyze with the remaining reads.

Thanks in advance!

colinbrislawn · February 8, 2018, 12:34am

Good afternoon,

The DADA2 developer @Nicholas_Bokulich, will be able to comment more on that error correction algorithm. I wanted to mention the core issues that could be causing poor taxonomy assignment.

Data with lots of errors
If errors are introduced into real amplicons, it will be hard to accurately classify these into their true taxonomy. Based on the low quality near the ends of your reads and the short area of overlap, I'm guessing errors are introduced. Nick can suggest how to fix this...
Incomplete taxonomy in database
If your taxonomy database does not include matching microbes, precise classification will be impossible. I think the mouse gut microbiome should be relatively well covered, so I'm not sure this is the issue for your data set.
Poor performance of taxonomy assignment algorithm
If something went wrong with the database indexing or the settings for LCA are too conservative (or too permissive!), taxonomy annotation can we worse than expected.* The qiime defaults should be reasonable and methods like search + LCA are standard in the field, so I'm not sure if that's the issue here either.
non-amplicon sequences
PCR primers may amplify sequences outside of your target region or other reads may be introduced by the sequencer. As Nick suggested, these off-target effects can lead to unassignable reads because these reads are not 16S genes at all!

TLDR: I bet this is an issue with DADA2 denoising, and once low quality data is removed, taxonomy will improve dramatically.

I hope this helps,
Colin

*Ironically, taxonomic assignments can also be overly precise, e.g. claiming to identify specific species, when only family level classification is reasonable.

Nicholas_Bokulich · February 8, 2018, 3:41pm

Hi @Mehrbod_Estaki and @colinbrislawn,

First to clear my name: I am not the dada2 developer — that would be @benjjneb — but I am very flattered.

Nevertheless, I think I can solve your problem, @Mehrbod_Estaki. (and thank you @colinbrislawn for helping troubleshoot!)

Both of you have raised a number of good possibilities, but based on the taxonomic profiles and length heterogeneity issue that you've shared it looks like this is almost certainly an issue with non-target DNA, not an issue with the reference taxonomy or classifier parameters (both of which are common issues when low assignment specificity is observed, but usually assignments are poor across the board, not just a segment of sequences). Your other classifications look good, so there are just a few bad apples in there.

I would go for option B. It's possible that errors are creeping in based on the slightly permissive dada2 trimming settings that you've used, but it sounds like (if seqs are BLASTing to host genes) that this is non-target DNA, not seq errors. You have three different options and at least the first two are easy:

Just filter out all features/seqs with taxonomy assignments to "Unassigned" or "Bacteria;__;..." and don't think another thought about it.
Filter out all features/sequences that do not align well to the 16S reference seqs, following this tutorial. I do not have a good threshold in mind for this, but let us know if you find one that works. After filtering those features from your sequences and feature table, I believe the old taxonomy assignments should still work (i.e., no need to re-classify the filtered seqs). Alternatively, you can filter out seqs that align to the mouse genome or some other target within a certain threshold using the same method.
The length heterogeneity is another good indicator that non-target DNA is slipping in from somewhere. You could instead filter out all sequences that are below a certain length. However, there might be non-target seqs that happen to have the right length and will pass through, so this is by far my least favorite option.

I don't think this is the case — looks like @Mehrbod_Estaki is using the default settings with standard databases, so these have already been well tuned.

Thanks for bringing this up, @colinbrislawn. We tested this very scenario on the QIIME2 (and other) classifiers in this preprint. In fact, all classifiers (QIIME2, RDP, others) perform fairly poorly in this regard and if an investigator expects to have many "novel" taxa in their samples (and are disturbed by the prospect of overclassification), it is best to set entirely different (i.e., very cautious) parameter settings, such as the specific recommendations that we make in the preprint. Not something to worry about in the mouse gut!

Thanks for the discussion guys! I hope this helps @Mehrbod_Estaki!

colinbrislawn · February 8, 2018, 7:05pm

Thanks for the update Nick. I've added 4) PCR off-target effects to the list things that can cause this.

Not sure why I thought you were a dada2 dev. I feel like I should have known you were a member of the Caporaso Lab.

Colin

Mehrbod_Estaki · February 8, 2018, 8:06pm

Thanks @Nicholas_Bokulich and @colinbrislawn!

This sounds very temping...but considering the combination of unassigned and Bacteria; make up in some samples over 95% of all the assignments I'm worried I would ultimately have to drop the samples entirely. Though since we think these are real host sequences that have made it through perhaps those samples are doomed anyways...

I'll play around with all all the suggestions and post the progress.
As always, thanks and this was very informative!

colinbrislawn · February 8, 2018, 11:44pm

Hello again Mehrbod,

Thanks for keeping us posted on your findings. 95% is huge! What a disappointment...

If 4) off-target effects is to blame, you can verify that by BLASTing some of your most abundant unidentified reads against the mouse genome on the NCBI website. Or maybe they are from the human genome, and you can search for that too.

At one point, a collaborator was having trouble with their MiSeq and lot's of Phix was ending up in their samples. Illumina spikes in the Phix bacteriophage as a positive control, so maybe these are the source of your reads. You could search your reads against the Phix genome to see if this is happening.

It's a science mystery!

Colin

Mehrbod_Estaki · February 9, 2018, 12:26am

I love a good mystery.
There are about 950 unique ASVs that are assigned only at Bacteria level. Randomly blasting a few of them results in things like this:

Mus musculus targeted non-conditional, lacZ-tagged mutant allele F10:tm1e(EUCOMM)Hmgu; transgenic
Mus musculus 10 BAC RP23-287L13 (Roswell Park Cancer Institute (C57BL/6J Female) Mouse BAC Library) complete sequence
Mus musculus chromosome 12, clone RP23-68M8, complete sequence

All with 99-100% coverage. So it is likely the source of problem..

I had considered the PhiX as well, since I'm sure our facility spiked some to increase heterogeneity in the run for better yield, but these seem like host genes. Also, I can't really think how the PhiX would get introduced to specific samples (with barcodes) when it is added after that step.
Another point I just remembered is that a different subset of samples from this same run (for a different project) had a similar issue when I used DADA2 paired-ends. Back then I didn't dive into it and just used Deblur (forward reads only) instead and that seemed to eradicated all the unassigned and Bacteria;-only-assignments issue. Perhaps this makes sense since the pre-packaged error-model of deblur is specific to 16s and so would drop the host variants, whereas dada2 would do this without source discrimination. I think I'll add that to list of things to try, use forward reads with deblur, compare forwards with dada2 as well. Will keep you posted.
Thanks for letting me think out loud!

Nicholas_Bokulich · February 9, 2018, 2:06am

I believe dada2 does an initial filter to remove phiX reads (but I may be wrong)

deblur denoise-16S uses an initial positive filter to keep only reads that align to the GG reference seqs with at least some minimum % ID (somewhere in the ballpark of 85% if I recall correctly). So essentially option #2 that I proposed above is built into deblur, explaining why that solved your issue on that older dataset.

Not sure how so many non-target genes are getting into your run, but sounds like the mystery is solved. With that in mind, I think that either option #1 or (if you are worried about blindly removing unclassified reads), option #2 would be very appropriate for resolving this.

Good luck @Mehrbod_Estaki!

Mehrbod_Estaki · February 13, 2018, 2:44am

Update!

As per @Nicholas_Bokulich's recommendation I chose to do option 2.
My reasons were mainly because I wanted to retain as much resolution as I could with paired-end DADA2 method, instead of deblur's forward only (which was my own subpar suggestion) . It's possible that option 1 and 2 would have given pretty similar results (identical even?) but once option 2 worked well I got too lazy to bother testing option 1.

This is something I've been struggling with myself the last little while and I'm starting to think this is actually happening across board in all studies that are analyzing mouse colon but unfortunately is either ignored or not properly reported. To properly extract colonic microbes the tissue needs to be thoroughly disrupted and when that happens there is a very high amount of host DNA. Unless that is selectively removed, its going to enter the amplicon PCR. Depending on the primer sets, this may lead to some downstream issues, as seems to be the case with our V3-V4 primer set.
Anyways, return from tangent.

What I did for others in the same situation:
Downloaded the 99 rep-seqs from greengenes gg_13_8 to use as my reference database (first qiime import it as 99_otus.qza).
Then I filtered my dada2 derived rep-seqs against this reference database.

qiime quality-control exclude-seqs \
  --i-query-sequences rep-seqs-dada2.qza \
  --i-reference-sequences 99_otus.qza \
  --p-method vsearch \
  --p-perc-identity 0.97 \
  --p-perc-query-aligned 0.95 \
  --p-threads 4 \
  --o-sequence-hits hits.qza \
  --o-sequence-misses misses.qza \
  --verbose

This takes a very long time as a heads up. Using more lenient parameters likely will reduce this significantly.
I dug into deblurs codes a bit and found that it actually uses the 88_otus version of greengenes for this step which seemed odd to me to not use the 99, but perhaps its a computation issue? I wonder if this was benchmarked and found to be good enough? Curious to know...

Anyways, then simply remove the the misses.qza features from your original feature table

qiime feature-table filter-features \
  --i-table table-dada2.qza \
  --m-metadata-file misses.qza \
  --o-filtered-table no-miss-table-dada2.qza \
  --p-exclude-ids

I tried to filter the missed seqs from my original rep-seq.qza too but that subtype isn't supported yet. Everything still works fine but just takes a bit longer with the non-filtered rep-sets.

Anyways, as expected I did lose a few samples after this because they were in fact dominated with host contaminants but it did resolve the issue completely and I still was able to carry on with the remaining samples. Most importantly, I learned a bit more about under the hood

Thanks again @Nicholas_Bokulich and @colinbrislawn!

colinbrislawn · February 13, 2018, 4:56pm

Thanks for wrapping up the story with your detailed solution.

I'm blown away that so many mouse genes ended up on your run. While I use the V4 only primers, I still would not think that the V3-V4 primers would suffer from such strong off-target effects, even if they are swimming in host DNA.

I think the search performed by quality-control exclude-seqs will always be a little intensive. By choosing a smaller database to search (say 88_otus.fasta) and a lower identity cutoff (say --p-perc-identity 0.90), you will play into the strengths of the vsearch algorithm, gaining speed while losing little precision.

I'm glad you figured this out! Mystery solved!

Thank you @Mehrbod_Estaki
Colin

system · March 16, 2018, 10:56pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.