I have my data generated by Illumina Miseq PE 250, I imported the demultiplexed, barcode-removed and low-quality-removed pair end sequence and applied the DADA2 for join and denoise.
I then import the GreenGene 13_5 database and trim the reference sequences by my own primer. Then I trained the classifier by the trimmed reference sequence sequence and did the classification. I obtained unexpected result so I am thinking whether I did anything wrong in the feature classification. What I did was like:
I have several question about the above procedures
I trimmed the unaligned reference sequence and applied the classification on my own unaligned representative sequence, it is correct? GreenGene do provides the aligned reference sequence, should I trim this one and do the classification on my aligned representative sequences?
If I prefer to use “classify-consensus-vsearch”, do I still need to train the trimmed reference sequences by “fit-classifier-naive-bayes”, and again, should I used the aligned sequence in all the procedures?
I used Mothur before and obtained a “satisfied” microbial composition, which means the result is close to what’s described in the references of my field.
Please forgive my inappropriate word choice “unexpected”, I just wonder if I do everything in a right way.
Actually I perform the PICRUST for my QIIME2 output, before that, I applied the vsearch closed reference clustering, but obtained a very low match rate, so I think I must be wrong in some places and even in the classification procedure, here’s what I did:
You did that correctly. Use the unaligned, not the aligned.
No, you do not train a classifier. You just use the imported greengenes taxonomy and reference sequences (same inputs as in fit-classifier-naive-bayes) as input to classify-consensus-vsearch.
Many of the references in your field probably used similar methodology, so a similar result is not surprising — but a satisfying result does not necessarily indicate the "correct" result. The sklearn classifier here is a bit more cautious and less likely to overclassify, but you can modify the parameters as described here to make the classifier less cautious (akin to the RDP classifier used by mothur, which is actually a somewhat similar algorithm to the sklearn naive bayes classifier you are using in QIIME 2)
That said, I really do not expect a wildly different result between the RDP and sklearn classifiers — so it is possible that something went wrong. Can you share a barplot or other results and be a bit more specific about what is unsatisfying? If there are many unclassified sequences, there are other posts on the forum that explain common errors related to this...
what sample type are you using? If you would expect a higher match rate to the reference database, you are either using the wrong reference database or there is something wrong with your sequences, e.g., a high number of non-target (e.g., contaminant) sequences.
Thanks for your suggestion, I think I can post the barplot latter
My sample type is the the fish gut content. I amplified the v3-v4 region of my extracted DNA, I tried SILVA and Greengene database, do you have any other suggestion?
Another of my consideration is, I am going to apply the PICRUST for my downstream analysis. As I know, PICRUST is applying GreenGene 13_5 version as its latest reference database. Do you think I should keep the database consistent when I do the taxonomic classification, because I am worry about the judgement if I apply different database to my data for different analysis.
No that sounds fine. I am not too familiar with fish guts, but would not expect so many unclassified (with feature classifier) or unmatched to reference (vsearch closed-reference OTU picking prior to PICRUST) sequences, so there may be another issue with the data, e.g., large amounts of host or food DNA or other non-target sequences in the data. A barplot of the taxonomy results and more information on your library prep/sequencing protocol could help diagnose.
I recommend completely decoupling your PICRUST analysis from your other analyses. PICRUST has its own needs — like closed-reference OTU picking and GG 13_5 database — but needn't be applied to your other analyses, which can be much more flexible and nuanced in their approach. I recommend:
Use dada2 or deblur to denoise your sequence data
Use the denoised sequences/table for downstream analyses to study microbiome composition and diversity. Not OTU picked data (unless if that really is your preference). Use whatever reference database is best suited to your data, don't use GG 13_5 just because that's what PICRUST does (you will not be comparing those data sets directly, anyway)
Perform closed-reference OTU picking on the denoised sequences output from dada2 or deblur for downstream PICRUST analysis.