I had great luck analyzing my 16s samples with qiime2 but am having great difficulty with my 18s samples. I originally used the search option and it came back 100% unclassified when I created the taxa marplot. however, when I ran
qiime feature-table tabulate-seqs
I was able to click on each feature and blast it, and I got 100% identity matches that made sense. What gives? I thought maybe I picked the wrong database (I used silva 138).
So I'd like to try sk-learn as well. but am really having trouble figuring out which ones contain 18s sequences (non fungal). so, does the Silva 138 99% OTUs full-length sequences pre-trained classifier work with 18s sequences?
Yes the SILVa SSU database contains both the 16S and 18S rRNA gene sequences.
What was the orientation of the query sequence in the BLAST results? I ask because, the your query sequences must be in the same orientation as classify-sklearn for it to work. If not, you'll likely observe that the returned taxonomy assignments are either spurious or unclassified. One good test would be to use classify-consensus-vsearch, which orientation of your query sequences won't matter. However, orientation will be an issue when trying to construct a phylogeny.
Also, you can try your hand at curating the SILVA database for your own needs via the RESCRIPt plugin, as outlined here, to make sure the curation approach outlined there did not unduly remove any important reference sequences.
Sorry, I did not provide enough information in my original post and was very unclear! I originally used the classify-consensus-vsearch and did not get good results so I thought I would switch to the sk-learn option and was wondering which database to use. my original search was:
and when I created my barplot, it said 100% unclassified for each sample
qiime taxa barplot
however when I ran
qiime feature-table tabulate-seqs
after putting in qiime2view and clicking on each feature and clicking "view report" on ncbi, I got 100% alignment matches. So I am not sure why the discrepancy? I thought maybe I used the incorrect Silva database since ncbi was able to assign taxonomy with no issue.
That is because you are using a very stringent setting:
Leave this parameter at the default setting and re-run.
You are comparing two very different approaches... the classify-consensus-vsearch will take all the equivalent top hits, and return a consensus taxonomy. That is, essentially taking the lowest common ancestor (LCA) of the top hit taxonomy strings. This is going to be much harder given your very stringent --p-perc-identity setting. This is not the same as the manual BLAST results you are viewing.
Tip: always consider running some of the commands with default settings, then alter the settings if something looks awry.
Hi @sabitondo, thank you for sharing your sequences with me. I also manually ran BLAST on these sequences. Although many of your sequences are returning close to ~100% identity via BLAST, they are also only presenting ~50 % query coverage. Meaning you have 100% identity over only half a given reference sequence in GenBank. For example, many of your ~ 326 base sequences only match from position ~82 through ~242.
My first guess would be that the primer sequences were not trimmed, or other quality control..., prior to denoising and taxonomic classification. That is, many of your sequences can only match half of any given reference sequences. This leads to too many mismatches, hence lack of robust classification.
Have you used cutadapt to remove primer sequences? What primers did you use to amplify your sequences?