Too many unassigned or only at kingdom level features

Update!

As per @Nicholas_Bokulich's recommendation I chose to do option 2.
My reasons were mainly because I wanted to retain as much resolution as I could with paired-end DADA2 method, instead of deblur's forward only (which was my own subpar suggestion) . It's possible that option 1 and 2 would have given pretty similar results (identical even?) but once option 2 worked well I got too lazy to bother testing option 1.

This is something I've been struggling with myself the last little while and I'm starting to think this is actually happening across board in all studies that are analyzing mouse colon but unfortunately is either ignored or not properly reported. To properly extract colonic microbes the tissue needs to be thoroughly disrupted and when that happens there is a very high amount of host DNA. Unless that is selectively removed, its going to enter the amplicon PCR. Depending on the primer sets, this may lead to some downstream issues, as seems to be the case with our V3-V4 primer set.
Anyways, return from tangent.

What I did for others in the same situation:
Downloaded the 99 rep-seqs from greengenes gg_13_8 to use as my reference database (first qiime import it as 99_otus.qza).
Then I filtered my dada2 derived rep-seqs against this reference database.

qiime quality-control exclude-seqs \
  --i-query-sequences rep-seqs-dada2.qza \
  --i-reference-sequences 99_otus.qza \
  --p-method vsearch \
  --p-perc-identity 0.97 \
  --p-perc-query-aligned 0.95 \
  --p-threads 4 \
  --o-sequence-hits hits.qza \
  --o-sequence-misses misses.qza \
  --verbose

This takes a very long time as a heads up. Using more lenient parameters likely will reduce this significantly.
I dug into deblurs codes a bit and found that it actually uses the 88_otus version of greengenes for this step which seemed odd to me to not use the 99, but perhaps its a computation issue? I wonder if this was benchmarked and found to be good enough? Curious to know...

Anyways, then simply remove the the misses.qza features from your original feature table

qiime feature-table filter-features \
  --i-table table-dada2.qza \
  --m-metadata-file misses.qza \
  --o-filtered-table no-miss-table-dada2.qza \
  --p-exclude-ids

I tried to filter the missed seqs from my original rep-seq.qza too but that subtype isn't supported yet. Everything still works fine but just takes a bit longer with the non-filtered rep-sets.

Anyways, as expected I did lose a few samples after this because they were in fact dominated with host contaminants but it did resolve the issue completely and I still was able to carry on with the remaining samples. Most importantly, I learned a bit more about under the hood :stuck_out_tongue:

Thanks again @Nicholas_Bokulich and @colinbrislawn!

6 Likes