I had a similar issue with the previous ones who posted on the topic I mentioned above. I denoised with DADA2 , truncated at 160 based on the QC plots still my taxonomy suffered with either unassigned or poorly assigned at kingdom level only. I just want to clarify some details before running the scripts in Qiime2 2019.7. I wanted to resolve this issue, is removing poorly assigned ASVs, the way to go? I suspect these are the host DNA since homology is the same with 16s V4-V5 regions.
I would recommend using NCBI BLAST to confirm that these are host DNA, but yes I think removing these is probably the way to go.
The main indicator that this is a problem with the samples and not the classifier is that the classifier appears to work just fine on the mock community (which I am assuming was amplified and sequenced along with your velvet worm samples)
Yes that workflow would accomplish what you want
Yes, the inputs would be your table and sequences output from dada2 (or whatever denoising method you used) and the taxonomy would be the taxonomy classifications you already have — no need to reclassify anything since you are just removing features not adding or changing anything.
Thanks, Nick! I randomly blasted one and it turned out to be what I suspected, the host. Is there an efficient way to check all sequences using NCBI? Like the file is in .qza, do I need to convert it to FASTA file or you don’t have to?
You can convert the sequences to fasta (by exporting with qiime tools export), then upload the fasta to NCBI BLAST and wait for a long time
What I usually do is just BLAST a handful; I use qiime metadata tabulate to merge both taxonomy and sequences, use the search bar to restrict to the taxa of interest (e.g., unclassified, only classified at kingdom level) and then grab a few at random to BLAST them.
You know what, now that I think about this exclude-seqs may just be the best way to go. Presumably your host DNA sequences are different enough that you can clearly rule them out based on alignment to a 16S reference database. My initial hesitation was that in some cases the host DNA can be things like mitochondrial 16S rRNA gene sequences which are similar enough to bacterial 16S that you might not be able to distinguish clearly with exclude-seqs. Anyway, worth testing both way to see what works best!
I did both , downloaded fasta file from the QIIME 2 VIEW so I got the sequences in raw fasta file. It came back with this from NCBI,
“CPU usage limit was exceeded. You may need to change your search strategy. Helpful changes include reducing the number of queries, choosing a smaller database and setting an organism limit. You may also need to adjust the Algorithm parameters in the bottom section of the form. Choose a smaller number of target sequences, set a smaller Expect cut-off, use a larger word-size and turn on species specific repeats.”
And then I tried the “exclude seqs” but I was able to run it because it came back with ,
“Invalid value for “–i-reference-sequences”: Expected an artifact of at
least type FeatureData[Sequence]. An artifact of type FeatureData[Taxonomy] was provided.”
Use a set of reference sequences instead. If the non-target reads are velvet worm mitochondria you could just make a fasta file with velvet worm mitochondria sequences, import to QIIME 2, and use that (in which case the “misses.qza” will be file containing the sequences you want to keep.
Otherwise use something like the greengenes 88% OTUs here (the reason to use 88% OTUs is to make this fast, since you just need to find sequences that look remotely like these), in which case you want to keep the “hits.qza” sequences.
Either way, you may need to adjust the percent identity if you get unexpected results (e.g., because 0.5 is too lenient).
for the taxonomy you still want to use the taxonomy classification that you presumably already obtained for these sequences (e.g., using classify-sklearn), do NOT use the hits.qza which is the wrong data type (it contains sequences not taxonomy classifications).
Thanks,Nick, that was helpful. I did use the result from the sklearn for my taxonomy which was a result from hard trimming at 160 bp. The taxa marplot still returned with many unassigned taxa and others at Kingdom level still.
This is confusing, there’s a lot going on there and the results seem a bit inconsistent. It looks like you are using exclude-ids with the misses and all of the output features are unclassified, which is the opposite of what we’d expect! Is that what you are using? Maybe try including the hits instead of excluding the misses? Or exclude the misses just to see what happens?
Nick, how do you include hits instead of excluding the misses? Ho w do you exclude the misses?
I went to read https://docs.qiime2.org/2019.7/plugins/available/feature-table/filter-features/ and found this --p-no-exclude-ids instead (didn’t improve anything). I am confused too it worked for the others but I can’t, maybe there are details I might have missed here. This was not discussed in the training you gave a year ago so I couldn’t find in my notes.
Could you share v2nov20160positonlyparkdada2_rep_set.qza with me? At this stage it may just be easiest for me to test directly on this file to see what’s going on.
with --p-no-exclude-ids and input the 70hits.qza output from exclude-seqs
Something weird is going on — I don’t think it is something that you are doing wrong, and this command indeed works for others all the time, so I think it may be something specific to your sequences instead. I wonder if the velvet worm sequences may just be quite similar to the 16S sequences, maybe I was wrong in this assumption:
So the alternative is to fall back on my original advice: use NCBI BLAST on a handful of unclassified sequences just to be sure they are host DNA, then filter them all out if they are.
Thank you for sharing your data. It looks like lots of the unassigned sequences are not being filtered because they are host 18S sequences and still share some degree of homology with the reference sequences (this could also be complicated by even just a few of the reference sequences being dubious). Still, the fact that these are being retained at high % id thresholds is not looking right and I need to dig in some more, you may have uncovered a bug.
For now, I recommend following my initial advice to filter based on the classification. I have examined your unclassified reads and each one I checked matches velvet worm. So just use qiime taxa filter-table to remove any features that are not classified to at least phylum level (using the example in the filtering tutorial on qiime2.org). You should do the same to your sequences and rebuild your tree if using phylogenetic diversity metrics, e.g., UniFrac, in your analysis.
The 515F/926R primer set does amplify 18S quite well which means you do get these mixed 16S/18S amplicons. To deal with this problem, we’ve used bbsplit to partition 16S and 18S based off of curated 16S and 18S sequences (derived from silva). See here for code and databases etc:
In our case, we actually want to keep the 18S because for ocean samples it might include things we’re interested in like eukaryotic phytoplankton, but in your case you can just discard the 18S if you’re not interested in any of them (though you might check to see what kind of non-host eukaryotes are also present).