a couple of questions I have in regard to the two databases.
I used the pre-trained databases in the tutorial. I have different results in taxa. They demonstrate in phylum what I checked so far. It is common knowledge that SILVA is more reliable owing to its frequently updated version released sooner than Greengenes, but in the phylum level, their outputs should be so close; however, I received diversified results. More interestingly, the Greengenes result looks normal in my case rather than the SILVA’s result.
By the way, what I achieved from SILVA classifier I have a huge number of unclassified taxa while the statistics of Greenegenes showing by far lower. I would like to know its reason!
I tended to train my own classifiers for both databases. In the last version of SILVA, there are some files in taxonomy directory, such as "all levels "and “7-levels”. which one of them should appropriately be taken up? and why?
Greengenes does not have different taxonomy files as SILVA has. For example, there are different similarities like 99%, 97% and so forth. On the other hand, I did not see different similarities in Greengene website. I would like also to know how did you find the 85% similarity of Greenegen database.
Finally, Greeneges’s website launched the new modification of the last version a few days ago. I do not know whether these changes cover the Greenegenes limitations against SILVA or not. What is your opinion? And there is not present the unaligned sequences except gg_13_5_ssualign.fasta.gz.
Are both classifiers trained on the correct region? E.g., did you use the full-length greengenes classifier vs. the V4 SILVA classifier?
There is a "notes" file in the SILVA releases that explains this — I recommend checking that document. Short answer: a taxonomy with even numbers of levels allows the use of taxonomic classification methods that rely on confidence or consensus scores to determine the most likely taxonomic lineage. If the levels are uneven, such methods can become confused because the taxonomic lineages for related organisms look very different if they have uneven levels.
Download the QIIME-compatible releases here. Those contain sequence and taxonomy files clustered at different % similarities.
You should contact secondgenome to ask about that modification. It seems that is the same 13_5 release, not a new release.
Step three in training is arbitrary, as i understood. The classifier was not trained to special part of the 16S rRNA gene, so I skipped it.
I am using full length for both databases.
I even tried 97% similarity instead of 99% taxonomy. The newly result is similar to the result I received with 99% one — still I have the unmatched taxonomy barplots. Also, the unassigned taxa are abundant in the barplot I got from the SILVA database. desperately wired!
This is for the Greengenes. If I consider the level 7 in the Greengenes's results, there the unassigned bacteria but the SILVA's result hugely are just unassigned when I put the taxonomy to level 7.
It looks like you are getting the same bad result with both, but what is "unassigned" with SILVA is for some reason receiving kingdom-level classification (and no more) with greengenes.
I would recommend removing any sequences that do not have at least phylum-level classification, but first I recommend spot-checking a few: grab the sequence and use NCBI BLAST to try and determine if this is non-target DNA, junk, or if it really is bacterial DNA that should be receiving a classification.
NCBI BLAST will confirm for us. This could be a technical error: the sklearn classifier does not handle sequences that are in mixed orientations (i.e., a mixture of forward and reverse reads relative to the orientation they appear in in the reference database), in which case I recommend the classify-consensus-vsearch method, which can handle these.
However, given that the unassigned reads are mostly in a few samples, I suspect this is not a mixed-orientation issue (which would more or less impact all samples evenly), so I suspect this is just non-target or bad reads, in which case you should filter these out.
I checked some sequences randomly (there are a lot of reads), and blasted. many microorganisms are uncultured bacteria as you see in the photo. As an example, I attached a part of the blasting.
I think this method was good, but I wondered that I just analyzing by forward read, not revers (no mixture reads in my forward file). The reverse reads are in a different file that I did not use it. This method, anyway, improved my result remarkably:slightly_smiling_face: but I still have a fraction of unassigned taxa specially in the two samples which are high. For your information I just used the Greengenes database to data. Do you have any suggestion to solve it? Or do you know why they are? Please share your idea.
I enjoyed your suggestion to use the method. I never thought I would be able to solve this issue although I have the problem in a range
. You made my day!
you should exclude uncultured organisms in your blast search.
What I mean is the orientation of the reads relative to the reference, not the read direction (e.g., forward and reverse paired-end reads). Some sequencing protocols result in mixed-orientation reads.
These are probably junk DNA or non-target DNA, e.g., host. This is not an uncommon problem and I recommend blasting a few of those reads to confirm, then if they appear to be non-target just filter them all out as described in the tutorials.
Two questions over this topic distressed me that’s why I retrieved it!
Vsearch classifier worked for me in the end, but I have unassigned taxa in the treated samples while all untreated samples were faced with too low unranked taxa (negligible). Firstly, I am curious to know what is the reason? What probabilities maybe you suggest?
Next, I used the pre-trained Greengenes classifier which is existed in Qiime2 tutorial. Its result showed too low unassigned taxa for all treated and untreated samples. I mean this worked better than Vserach classifier in treated samples (about untreated samples differences were not remarkable).
I do not know the reason but draw two important conclusions:
the effect is minor and I would not worry about it (see steps below)
it is not a technical error (i.e., something QIIME 2 is doing wrong). It is probably not an error at all, but follow the steps below to find out.
Now what do we do about it?
use NCBI BLAST to spot check a few of the unclassified sequences, as I described above
my guess is these should be junk or host DNA. The reason they are more abundant in the treatment? Depends on what the treatment is and the biomass of those samples. E.g., a treatment like broad-spectrum antibiotic use will lead to lower biomass followed by potentially greater detection of background "noise".
just remove all unassigned taxa after seeing what the cause might be, unless if it is not junk.