does Silva 138 99% OTUs full-length sequences pre-trained classifier work with 18s sequences?

Hi all,

I had great luck analyzing my 16s samples with qiime2 but am having great difficulty with my 18s samples. I originally used the search option and it came back 100% unclassified when I created the taxa marplot. however, when I ran
qiime feature-table tabulate-seqs
--i-data rep-seqs-18s-trunc.qza
--o-visualization rep-seqs-18s-viz.qzv

I was able to click on each feature and blast it, and I got 100% identity matches that made sense. What gives? I thought maybe I picked the wrong database (I used silva 138).

So I'd like to try sk-learn as well. but am really having trouble figuring out which ones contain 18s sequences (non fungal). so, does the Silva 138 99% OTUs full-length sequences pre-trained classifier work with 18s sequences?

Thank you! Let me know if you need more info.

Hi @sabitondo,

Yes the SILVa SSU database contains both the 16S and 18S rRNA gene sequences.

What was the orientation of the query sequence in the BLAST results? I ask because, the your query sequences must be in the same orientation as classify-sklearn for it to work. If not, you'll likely observe that the returned taxonomy assignments are either spurious or unclassified. One good test would be to use classify-consensus-vsearch, which orientation of your query sequences won't matter. However, orientation will be an issue when trying to construct a phylogeny.

Also, you can try your hand at curating the SILVA database for your own needs via the RESCRIPt plugin, as outlined here, to make sure the curation approach outlined there did not unduly remove any important reference sequences.

1 Like

Hi there,

Sorry, I did not provide enough information in my original post and was very unclear! I originally used the classify-consensus-vsearch and did not get good results so I thought I would switch to the sk-learn option and was wondering which database to use. my original search was:

qiime feature-classifier classify-consensus-vsearch
--i-query rep-seqs-18s-trunc.qza
--i-reference-reads silva-138-99-seqs.qza
--i-reference-taxonomy silva-138-99-tax.qza
--o-classification taxonomy-18s.qza
--p-perc-identity 0.99
--p-threads 8

and when I created my barplot, it said 100% unclassified for each sample
qiime taxa barplot
--i-table table-18s-trunc.qza
--i-taxonomy taxonomy-18s.qza
--m-metadata-file CG_metadata_just_18s.tsv
--o-visualization taxa-barplot-18s.qzv

however when I ran
qiime feature-table tabulate-seqs
--i-data rep-seqs-18s-trunc.qza
--o-visualization rep-seqs-18s-viz.qzv

after putting in qiime2view and clicking on each feature and clicking "view report" on ncbi, I got 100% alignment matches. So I am not sure why the discrepancy? I thought maybe I used the incorrect Silva database since ncbi was able to assign taxonomy with no issue.

Thanks,
Stephanie

1 Like

That is because you are using a very stringent setting:

Leave this parameter at the default setting and re-run.

You are comparing two very different approaches... the classify-consensus-vsearch will take all the equivalent top hits, and return a consensus taxonomy. That is, essentially taking the lowest common ancestor (LCA) of the top hit taxonomy strings. This is going to be much harder given your very stringent --p-perc-identity setting. This is not the same as the manual BLAST results you are viewing.

Tip: always consider running some of the :qiime2: commands with default settings, then alter the settings if something looks awry.

1 Like

Oh, great! That makes a lot of sense. I will rerun with the default parameter and report back. Thanks!

2 Likes

unfortunately even this command

#classify taxonomy
qiime feature-classifier classify-consensus-vsearch
--i-query uchime-dn-out-18s-trunc/rep-seqs-nonchimeric.qza
--i-reference-reads silva-138-99-seqs.qza
--i-reference-taxonomy silva-138-99-tax.qza
--o-classification taxonomy-18s-trunc-nochim.qza
--p-threads 8

leaves me with almost 100% unclassified even at level 1, except for some bacteria

1 Like

Hi @sabitondo,

Very strange. When you manually BLAST the sequences, what gene is reported in the result? Sometimes off-target genes...

Would you be willing to share you're rep-seqs-nonchimeric.qza file with me, through dropbox or something via private message?

Hi,

I am running blast using ncbi's SSU_eukaryote_rRNA database. I get a ton of hits per feature (mostly algae, flagellates, and ciliates). Sure, I can send you my file through private message.

Thanks,
Stephanie

Hi @sabitondo, thank you for sharing your sequences with me. I also manually ran BLAST on these sequences. Although many of your sequences are returning close to ~100% identity via BLAST, they are also only presenting ~50 % query coverage. Meaning you have 100% identity over only half a given reference sequence in GenBank. For example, many of your ~ 326 base sequences only match from position ~82 through ~242.

My first guess would be that the primer sequences were not trimmed, or other quality control..., prior to denoising and taxonomic classification. That is, many of your sequences can only match half of any given reference sequences. This leads to too many mismatches, hence lack of robust classification.

Have you used cutadapt to remove primer sequences? What primers did you use to amplify your sequences?

-Mike

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.