Confuse about qiime2 feature-classifier

Lennon_Lee · May 21, 2018, 6:41am

Dear all,

I am new to the 16S amplicon study,

I have my data generated by Illumina Miseq PE 250, I imported the demultiplexed, barcode-removed and low-quality-removed pair end sequence and applied the DADA2 for join and denoise.

I then import the GreenGene 13_5 database and trim the reference sequences by my own primer. Then I trained the classifier by the trimmed reference sequence sequence and did the classification. I obtained unexpected result so I am thinking whether I did anything wrong in the feature classification. What I did was like:

> qiime feature-classifier extract-reads \
>   --i-sequences gg_13_5_99_otus.qza \
>   --p-f-primer CCTAYGGGRBGCASCAG \
>   --p-r-primer GGACTACNNGGGTATCTAAT \
>   --o-reads trimmed_gg_13_5_99_otus.qza

> qiime feature-classifier fit-classifier-naive-bayes \
>   --i-reference-reads trimmed_gg_13_5_99_otus.qza \
>   --i-reference-taxonomy gg_13_5_99_otu_taxonomy.qza \
>   --o-classifier trimmed_gg_13_5_99_classifier.qza

> qiime feature-classifier classify-sklearn \
>   --i-classifier trimmed_gg_13_5_99_classifier.qza \
>   --i-reads rep-seqs.qza \
>   --o-classification trimmed_classify-sklearn_gg_13_5_99_taxonomy.qza

I have several question about the above procedures

I trimmed the unaligned reference sequence and applied the classification on my own unaligned representative sequence, it is correct? GreenGene do provides the aligned reference sequence, should I trim this one and do the classification on my aligned representative sequences?
If I prefer to use "classify-consensus-vsearch", do I still need to train the trimmed reference sequences by "fit-classifier-naive-bayes", and again, should I used the aligned sequence in all the procedures?

Any suggestion or help is grateful!!

Jaroslaw_Grzadziel · May 21, 2018, 1:46pm

What do you exactly mean ?

Lennon_Lee · May 21, 2018, 2:16pm

Hello!

I used Mothur before and obtained a “satisfied” microbial composition, which means the result is close to what’s described in the references of my field.

Please forgive my inappropriate word choice “unexpected”, I just wonder if I do everything in a right way.

Actually I perform the PICRUST for my QIIME2 output, before that, I applied the vsearch closed reference clustering, but obtained a very low match rate, so I think I must be wrong in some places and even in the classification procedure, here’s what I did:

qiime vsearch cluster-features-closed-reference
–i-table my-table.qza
–i-sequences my-rep-seq.qza
–i-reference-sequences gg_13_5_97_otus.qza
–p-perc-identity 0.97
–p-threads 0
–o-clustered-table table-cr-97.qza
–o-clustered-sequences 97_clustered-seq.qza
–o-unmatched-sequences unmatched.qza

I think I should input the aligned representative sequence and the aligned reference sequence, is that right?

Thanks for your helping

Nicholas_Bokulich · May 21, 2018, 4:52pm

You did that correctly. Use the unaligned, not the aligned.

No, you do not train a classifier. You just use the imported greengenes taxonomy and reference sequences (same inputs as in fit-classifier-naive-bayes) as input to classify-consensus-vsearch.

Many of the references in your field probably used similar methodology, so a similar result is not surprising — but a satisfying result does not necessarily indicate the "correct" result. The sklearn classifier here is a bit more cautious and less likely to overclassify, but you can modify the parameters as described here to make the classifier less cautious (akin to the RDP classifier used by mothur, which is actually a somewhat similar algorithm to the sklearn naive bayes classifier you are using in QIIME 2)

That said, I really do not expect a wildly different result between the RDP and sklearn classifiers — so it is possible that something went wrong. Can you share a barplot or other results and be a bit more specific about what is unsatisfying? If there are many unclassified sequences, there are other posts on the forum that explain common errors related to this...

what sample type are you using? If you would expect a higher match rate to the reference database, you are either using the wrong reference database or there is something wrong with your sequences, e.g., a high number of non-target (e.g., contaminant) sequences.

No, use unaligned sequences.

I hope that helps!

Lennon_Lee · May 22, 2018, 12:45am

Thanks for your kindly reply!

Thanks for your suggestion, I think I can post the barplot latter

My sample type is the the fish gut content. I amplified the v3-v4 region of my extracted DNA, I tried SILVA and Greengene database, do you have any other suggestion?

Thank you very much

Lennon_Lee · May 22, 2018, 12:51am

Another of my consideration is, I am going to apply the PICRUST for my downstream analysis. As I know, PICRUST is applying GreenGene 13_5 version as its latest reference database. Do you think I should keep the database consistent when I do the taxonomic classification, because I am worry about the judgement if I apply different database to my data for different analysis.

Nicholas_Bokulich · May 22, 2018, 1:27pm

No that sounds fine. I am not too familiar with fish guts, but would not expect so many unclassified (with feature classifier) or unmatched to reference (vsearch closed-reference OTU picking prior to PICRUST) sequences, so there may be another issue with the data, e.g., large amounts of host or food DNA or other non-target sequences in the data. A barplot of the taxonomy results and more information on your library prep/sequencing protocol could help diagnose.

I recommend completely decoupling your PICRUST analysis from your other analyses. PICRUST has its own needs — like closed-reference OTU picking and GG 13_5 database — but needn't be applied to your other analyses, which can be much more flexible and nuanced in their approach. I recommend:

Use dada2 or deblur to denoise your sequence data
Use the denoised sequences/table for downstream analyses to study microbiome composition and diversity. Not OTU picked data (unless if that really is your preference). Use whatever reference database is best suited to your data, don't use GG 13_5 just because that's what PICRUST does (you will not be comparing those data sets directly, anyway)
Perform closed-reference OTU picking on the denoised sequences output from dada2 or deblur for downstream PICRUST analysis.

I hope that helps!

Nicholas_Bokulich · May 22, 2018, 6:36pm

An off-topic reply has been split into a new topic: Do any existing qiime2 tools allow classification against the NCBI 16S DB?

Please keep replies on-topic in the future.

Lennon_Lee · May 23, 2018, 2:04am

Thank you very much for your suggestions!!

system · June 23, 2018, 8:04am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.