Clarification on resolving too many unassigned or at kingdom level

Hi User Support,

I had a similar issue with the previous ones who posted on the topic I mentioned above. I denoised with DADA2 , truncated at 160 based on the QC plots still my taxonomy suffered with either unassigned or poorly assigned at kingdom level only. I just want to clarify some details before running the scripts in Qiime2 2019.7. I wanted to resolve this issue, is removing poorly assigned ASVs, the way to go? I suspect these are the host DNA since homology is the same with 16s V4-V5 regions.

Summary: I followed Parkinson's Mouse Tutorial in QIIME2 in analyzing, 16s, velvet worm gut, 2 x 300 pair-ends MISEQ runs on V4-V5 (515F-926R) regions with positive control (mock community). I used trained V4-V5 (http://kronos.pharmacology.dal.ca/public_files/taxa_classifiers/qiime2-2019.7_classifiers/).

-Outputs: v2nov20positonly_parkdemux_seqs.qzv (288.2 KB) v2nov20160positonlyparkdada2_table.qzv (435.9 KB) v2nov201603kpositonlyparktaxa_barplot.qzv (359.1 KB)

  • Here is what I picked up from the forum discussion:

qiime quality-control exclude-seqs
–i-query-sequences /data/p281301/test/repseqssingle.qza
–i-reference-sequences /data/p281301/test/SILVA_132_97_16S-v4-v6-ref-seqs.qza
–p-method vsearch
–p-perc-identity 0.50
–p-perc-query-aligned 0.50
–p-threads 6
–o-sequence-hits /data/p281301/test/hits1.qza
–o-sequence-misses /data/p281301/test/misses1.qza
–verbose

qiime feature-table filter-features
–i-table /data/p281301/test/tabledada2single.qza
–m-metadata-file /data/p281301/test/misses1.qza
–o-filtered-table /data/p281301/test/no-miss-table-dada2.qza
–p-exclude-ids

qiime taxa barplot
–i-table /data/p281301/test/no-miss-table-dada2.qza
–i-taxonomy /data/p281301/test/taxonomyhits1.qza
–m-metadata-file /data/p281301/test/metadatatest.tsv
–o-visualization /data/p281301/test/taxa-barplot1-nomiss.qzv

  • Clarification: input files are from running DADA2 denoising step output , right? and that the taxonomy here is the trained classifier V4-V5 taxonomy?

Thank you in advance!

I would recommend using NCBI BLAST to confirm that these are host DNA, but yes I think removing these is probably the way to go.

The main indicator that this is a problem with the samples and not the classifier is that the classifier appears to work just fine on the mock community (which I am assuming was amplified and sequenced along with your velvet worm samples)

Yes that workflow would accomplish what you want

Yes, the inputs would be your table and sequences output from dada2 (or whatever denoising method you used) and the taxonomy would be the taxonomy classifications you already have — no need to reclassify anything since you are just removing features not adding or changing anything.

I hope that helps!

Thanks, Nick! I randomly blasted one and it turned out to be what I suspected, the host. Is there an efficient way to check all sequences using NCBI? Like the file is in .qza, do I need to convert it to FASTA file or you don't have to?

You can convert the sequences to fasta (by exporting with qiime tools export), then upload the fasta to NCBI BLAST and wait for a long time :timer_clock: :sleeping:

What I usually do is just BLAST a handful; I use qiime metadata tabulate to merge both taxonomy and sequences, use the search bar to restrict to the taxa of interest (e.g., unclassified, only classified at kingdom level) and then grab a few at random to BLAST them.

You know what, now that I think about this exclude-seqs may just be the best way to go. Presumably your host DNA sequences are different enough that you can clearly rule them out based on alignment to a 16S reference database. My initial hesitation was that in some cases the host DNA can be things like mitochondrial 16S rRNA gene sequences which are similar enough to bacterial 16S that you might not be able to distinguish clearly with exclude-seqs. Anyway, worth testing both way to see what works best!

Hi Nick,
I did both , downloaded fasta file from the QIIME 2 VIEW so I got the sequences in raw fasta file. It came back with this from NCBI,

“CPU usage limit was exceeded. You may need to change your search strategy. Helpful changes include reducing the number of queries, choosing a smaller database and setting an organism limit. You may also need to adjust the Algorithm parameters in the bottom section of the form. Choose a smaller number of target sequences, set a smaller Expect cut-off, use a larger word-size and turn on species specific repeats.”

And then I tried the “exclude seqs” but I was able to run it because it came back with ,

“Invalid value for “–i-reference-sequences”: Expected an artifact of at
least type FeatureData[Sequence]. An artifact of type FeatureData[Taxonomy] was provided.”

This is the script I tried to run:

qiime quality-control exclude-seqs
–i-query-sequences v2nov20160positonlyparkdada2_rep_set.qza
–i-reference-sequences 515926v2160taxonomy.qza
–p-method vsearch
–p-perc-identity 0.50
–p-perc-query-aligned 0.50
–o-sequence-hits hits.qza
–o-sequence-misses misses.qza
–p-threads 10

Thanks for the help,
Imee19

Use a set of reference sequences instead. If the non-target reads are velvet worm mitochondria you could just make a fasta file with velvet worm mitochondria sequences, import to QIIME 2, and use that (in which case the "misses.qza" will be file containing the sequences you want to keep.

Otherwise use something like the greengenes 88% OTUs here (the reason to use 88% OTUs is to make this fast, since you just need to find sequences that look remotely like these), in which case you want to keep the "hits.qza" sequences.

Either way, you may need to adjust the percent identity if you get unexpected results (e.g., because 0.5 is too lenient).

Hi Nick,
As a follow up to my previous query, I used the 99.otus.qza instead and it worked up until the filter-features command. When I tried the taxa marplot: it returned with this,

Invalid value for “–i-taxonomy”: Expected an artifact of at least type FeatureData[Taxonomy]. An artifact of type FeatureData[Sequence] was provided.

And this were the commands I ran in Qiime:

qiime quality-control exclude-seqs
–i-query-sequences v2nov20160positonlyparkdada2_rep_set.qza
–i-reference-sequences 99_otus.qza
–p-method vsearch
–p-perc-identity 0.50
–p-perc-query-aligned 0.50
–o-sequence-hits hits.qza
–o-sequence-misses misses.qza
–p-threads 10

qiime feature-table filter-features
–i-table v2nov20160positonlyparkdada2_table.qza
–m-metadata-file misses.qza
–o-filtered-table no-miss-table-dada2.qza
–p-exclude-ids

qiime taxa barplot
–i-table no-miss-table-dada2.qza
–i-taxonomy hits.qza
–m-metadata-file v2.tsv
–o-visualization taxa-barplot1-nomiss.qzv

Thanks in advance for the help,
Imee19

for the taxonomy you still want to use the taxonomy classification that you presumably already obtained for these sequences (e.g., using classify-sklearn), do NOT use the hits.qza which is the wrong data type (it contains sequences not taxonomy classifications).

Thanks,Nick, that was helpful. I did use the result from the sklearn for my taxonomy which was a result from hard trimming at 160 bp. The taxa marplot still returned with many unassigned taxa and others at Kingdom level still.

taxa-barplot1-nomiss.qzv (366.1 KB) .

Best,
Imee19

This may be too lenient. Maybe try 0.7 or 0.8 for each of these

Hi Nick,
I tried running it from 70-80-97 and still a lot unassigned or kingdom level. How do I resolve this?
I have attached the results,
taxa-barplot1-70nomiss.qzv (366.8 KB)
taxa-barplot1-80nomiss.qzv (366.1 KB)
taxa-barplot1-nomiss.qzv (366.1 KB)
taxa-barplot1-95nomiss.qzv (371.6 KB)

Thanks fo the all the help extended,
Imee19

Hi @Imee19,
This is confusing, there’s a lot going on there and the results seem a bit inconsistent. It looks like you are using exclude-ids with the misses and all of the output features are unclassified, which is the opposite of what we’d expect! Is that what you are using? Maybe try including the hits instead of excluding the misses? Or exclude the misses just to see what happens?

Hi Nick,
I have followed the script which was previously used by the people who the same issue. Yes, it included excuse-ids. And this is the script:

qiime quality-control exclude-seqs
–i-query-sequences v2nov20160positonlyparkdada2_rep_set.qza
–i-reference-sequences 99_otus.qza
–p-method vsearch
–p-perc-identity 0.70
–p-perc-query-aligned 0.70
–o-sequence-hits 70hits.qza
–o-sequence-misses 70misses.qza
–p-threads 10

qiime feature-table filter-features
–i-table v2nov20160positonlyparkdada2_table.qza
–m-metadata-file 70misses.qza
–o-filtered-table no-70miss-table-dada2.qza
–p-no-exclude-ids

qiime taxa barplot
–i-table no-70miss-table-dada2.qza
–i-taxonomy 515926v2160taxonomy.qza
–m-metadata-file v2.tsv
–o-visualization taxa-barplot1-70nomiss.qzv

Nick, how do you include hits instead of excluding the misses? Ho w do you exclude the misses?
I went to read https://docs.qiime2.org/2019.7/plugins/available/feature-table/filter-features/ and found this --p-no-exclude-ids instead (didn’t improve anything). I am confused too it worked for the others but I can’t, maybe there are details I might have missed here. This was not discussed in the training you gave a year ago so I couldn’t find in my notes.
Thank you,
Imee19

Hi @Imee19,
Could you share v2nov20160positonlyparkdada2_rep_set.qza with me? At this stage it may just be easiest for me to test directly on this file to see what's going on.

with --p-no-exclude-ids and input the 70hits.qza output from exclude-seqs

Something weird is going on — I don't think it is something that you are doing wrong, and this command indeed works for others all the time, so I think it may be something specific to your sequences instead. I wonder if the velvet worm sequences may just be quite similar to the 16S sequences, maybe I was wrong in this assumption:

So the alternative is to fall back on my original advice: use NCBI BLAST on a handful of unclassified sequences just to be sure they are host DNA, then filter them all out if they are.

Hi Nick,
I will look into your suggestions now. I have also attached the file you were asking.

Thanks a lot,
Imee19

Thanks. I suppose I will need these as well to recreate the barplots (you can send to me in a DM if you do not want to share publicly):

@Imee19,
Thank you for sharing your data. It looks like lots of the unassigned sequences are not being filtered because they are host 18S sequences and still share some degree of homology with the reference sequences (this could also be complicated by even just a few of the reference sequences being dubious). Still, the fact that these are being retained at high % id thresholds is not looking right and I need to dig in some more, you may have uncovered a bug.

For now, I recommend following my initial advice to filter based on the classification. I have examined your unclassified reads and each one I checked matches velvet worm. So just use qiime taxa filter-table to remove any features that are not classified to at least phylum level (using the example in the filtering tutorial on qiime2.org). You should do the same to your sequences and rebuild your tree if using phylogenetic diversity metrics, e.g., UniFrac, in your analysis.

Thanks!

Hi Nicholas and lmee19,

The 515F/926R primer set does amplify 18S quite well which means you do get these mixed 16S/18S amplicons. To deal with this problem, we’ve used bbsplit to partition 16S and 18S based off of curated 16S and 18S sequences (derived from silva). See here for code and databases etc:

In our case, we actually want to keep the 18S because for ocean samples it might include things we’re interested in like eukaryotic phytoplankton, but in your case you can just discard the 18S if you’re not interested in any of them (though you might check to see what kind of non-host eukaryotes are also present).

hth,

Jesse

1 Like

Hi Nick and @jcmcnch,
Thank you for all your suggestions and I am currently reading through it . I will let you know if I made progress.
Best,
Imee19

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.