I have done taxonomic assignment bt training the classifier for a particular region of 16S rRNA gene.
My output shows assignment till bacterial domain for 15 samples.
I took the those OTUs sequences , and classify using silva web server, many of them could be classified till lower taxonomic levels. How is it possible?
Is there any problem with my classifier?
I even used the entire database for classification, then it classified those OTUs (15 samples) to Eukaryota.
Yes, I have used my own classifier. updated SILVA db (138.1)compatible with qiime2 using RESCRIPT.
Im using V3 region.
what identity percent it considers while annotating an OTU to a taxonomic level? Is it 100% ?
If so, then all sequences are remaining unclassified with the silva DB web server. If I lower the identity percent, most of them getting annotation.
Thanks for your suggestion, I will go through it.
If you made the reference database yourself, and did not cluster the reference sequences, then percent identity isn't really considered (i.e. the reference db can be used for 100-99% and downward). That is, you can use the reference database regardless of your clustering level. For some further thoughts in this see this post.
The default approach, does not cluster the reference sequence data, it is based on the SILVA NR99 database. This is due to the reasons outlined here. However, you can change the options for rescript get-silva-data to download the full reference database. You could certainly cluster your reference sequences yourself by modifying the rescript dereplicate parameters, say at 97%. Your reference sequence would then "consider" 97% similarity.
Since few samples were assigned only to bacterial kingdom/domain level, I used "classify-consensus-vsearch" method (sequence orientation issue suspected), which has sequence identity and query coverage 80% by default.
so what I was asking is that what % of identity and coverage is needed for taxonomy assignment at a particular level?
If I increase it to 95/97/99 then the number of "Unassigned" taxa are increasing.
So how should I decide the value?
Thank you for clarifying.
This depends on the variable region understudy and the organisms you expect to see. I can't really provide an answer here. But be wary of over-classification. That is, just because you can enrich for a classification at lower taxonomic ranks, e.g. family or genus level, does not mean that what you are observing is "correct".
Earlier in this thread I had mentioned that you check for mixed-orientation reads. Did you look into this? It appears you have, given your suspicion and use of vsearch. But did you look at the sequences directly to confirm?
Tip: You had mentioned earlier that you made use of the the SILVA web tool. This tool can reverse-complement the sequence if needed to find a "hit". Note the output of the Alignment Result Table, particularly the last column entitled Turn. If you see the phrase reversed and complemented, then this means your sequence was in a reversed orientation. Reverse complementing does not occur with the naive-bayes search, but as you already know, vsearch and blast can handle this for taxonomy assignment. However, you'd want to orient your sequences anyway if you plan to do phylogenetic analyses, otherwise your sequence alignments, and resulting phylogenies will be flawed. You can use orient-seqs from RESCRIPt to do this.
That makes sense as you are increasing the stringency of the search. This is why the default is 80%. I'd likely leave it at this level, and make use of the --p-maxaccepts and --p-min-consensus of vsearch. Also, consensus taxonomies are just that, a consensus of all the acceptable hits. Thus, you may obtain many hits to different genera but the same family... so your best assignment will be at the family-level. But you can adjust how the consensus taxonomy is determined by altering these two parameters in addition to the ones you are already altering. Perhaps try doubling the --p-maxaccepts and lowering the --p-min-consensus?
1- the classify-consensus-vsearch method to classify sequences, as this can classify reads in either orientation.
------I could classify my reads using this approach, but most of species are UNCULTURED. and results are not matching with previous classification done using classify-sklean
2- RESCRIPt has an orient-seqs method that will orient sequences relative to a reference database.
------I also used this and 350+ sequences were unmatched. I eliminated those. and used classify-sklearn to classify oriented sequences. and the result was again the same. ~15 samples were assigned till kindom/ domain bacteria. So i couldnt resolve the issue.
The different approaches will vary in their output. But they are usually conistant, at least at higher taxonomic levels.
These might be off-target sequences. What are these sample from? Soil? Host organism? If from a host organism these might be non-rRNA / off-target sequences. For example, some 16S primers will amplify 12S rRNA as off-targets. Which means they will not be assigned properly. Perhaps manually BLAST a few of these poorly assigned sequences and see what gene they are hitting?
Finally, would you be willing to share the QZA files for both the feature table and the representative sequences? You can send them to me via private DM.
Based on the Sample IDs, it appears you are trying to re-analyze data from this BioProject? After looking at the amplicon samples within the BioRpoject, and the rep-seqs file you shared, I noticed that the V3 primers were not trimmed prior to denoising with DADA2.
That is, I am assuming these (or highly similar) primers (Huse et al. 2008) were used:
If so, run cutadapt trim-paired, with the above primer sequences. I'd also make sure that the following options are enabled --p-discard-untrimmed and --p-match-adapter-wildcards. You can search the forum for related posts that explain why. Especially, in reference, as to why the primers should be removed.
Also, it appears that you may have to alter the orient-seqs parameters --p-perc-identity and possibly --p-query-cov . I tried setting --p-perc-identity 0.95 and that seemed to work well, while discarding slightly more sequences. If you want to be a bit more restrictive and truly do not care about Eukaryotes, then you can consider downloading and importing one of the GreenGenes reference databases from the Data resources page, and use that as the reference for orient-seqs.
However, I'd try the following approach first...
I do find it very odd that ~15 samples appear to be comprised mostly of mixed oriented reads. That is they are in fact not Eukaryotes, but other microbial taxa. Again, they are being mistakenly classified as Eukaryotes via sklearn as it is searching the for a read that is not in the same orientation of the database. Perhaps the forward and reverse reads where swapped by accident upon upload to GenBank?
One easy solution to this is to modify your Manifest file by swapping the file-paths for the forward and reverse reads for these samples. This should result in the reads to being imported in the correct orientation. After importing in this way you should be able to proceed with cutadapt, denoising, and then taxonomy assignment.