Sklearn-classifier bug for V3-V4 amplicons

SoMe · February 8, 2019, 1:05pm

Hi every one,

I experience a problem with classification of my sequences with the plugin feature-classifier classify-sklearn.
Indeed, I have amplicon data targeting the 16S V3-V4 region. First, raw data were treated with Dada2 :
qiime dada2 denoise-paired
--i-demultiplexed-seqs Metabarcoding_16S/QIIME/1-Import/paired-end-demux.qza
--p-trim-left-f 17
--p-trunc-len-f 280
--p-trim-left-r 20
--p-trunc-len-r 260
--p-max-ee 2
--p-trunc-q 2
--p-chimera-method consensus
--p-min-fold-parent-over-abundance 1.0
--p-n-threads 18
--p-n-reads-learn 1000000
--p-hashed-feature-ids
--o-representative-sequences Metabarcoding_16S/QIIME/2-Dada2/rep-seqs-dada2.qza
--o-table Metabarcoding_16S/QIIME/2-Dada2/table-dada2.qza
--o-denoising-stats Metabarcoding_16S/QIIME/2-Dada2/stats-dada2.qza

Then I performed taxonomic classification using this command lines :

1- Pre-training of the classifier for the V3-V4 region

qiime feature-classifier extract-reads
--i-sequences Databases/SILVA_132_QIIME_release/rep_set/rep_set_all/99/silva132_99.qza
--p-f-primer CCTACGGGNGGCWGCAG
--p-r-primer GACTACHVGGGTATCTAATCC
--o-reads Databases/SILVA_132_QIIME_release/rep_set/rep_set_all/99/silva132_99_V3-V4.qza

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads Databases/SILVA_132_QIIME_release/rep_set/rep_set_all/99/silva132_99_V3-V4.qza
--i-reference-taxonomy Databases/SILVA_132_QIIME_release/taxonomy/taxonomy_all/99/taxonomy_7_levels.qza
--o-classifier Databases/SILVA_132_QIIME_release/silva-132-99-nb-classifier_V3-V4.qza

2- Taxonomic classification with the V3-V4 pre-trained classifier
qiime feature-classifier classify-sklearn
--i-classifier Databases/SILVA_132_QIIME_release/silva-132-99-nb-classifier_V3-V4.qza
--i-reads Metabarcoding_16S/QIIME/2-Dada2/rep-seqs-dada2.qza
--o-classification Metabarcoding_16S/QIIME/3-Classification/taxonomy_db_V3-V4.qza
--p-n-jobs 10
--p-reads-per-batch 100
--p-confidence 0.7
--verbose

I also tried with the pre-trained classifier with full-length sequences that I downloaded from the QIIME2 website.
qiime feature-classifier classify-sklearn
--i-classifier /Databases/SILVA_132_QIIME_release/silva-132-99-nb-classifier.qza
--i-reads Metabarcoding_16S/QIIME/2-Dada2/rep-seqs-dada2.qza
--o-classification Metabarcoding_16S/QIIME/3-Classification/taxonomy_db_full-length.qza
--p-n-jobs 10
--p-reads-per-batch 100
--p-confidence 0.7
--verbose

Then I plot taxonomic classification with the plugin :
qiime taxa barplot
--i-table Metabarcoding_16S/QIIME/2-Dada2/table-dada2.qza
--i-taxonomy Metabarcoding_16S/QIIME/3-Classification/taxonomy_db_{V3-V4 or full-lenth}.qza
--m-metadata-file QIIME/APAZ1to27_metadata.csv
--o-visualization Metabarcoding_16S/QIIME/3-Classification/taxa-bar-plots.qzv

For each mode of taxonomic classification (V3-V4 vs full-length databases) I get 2 taxa barplots very different.
With the V3-V4

With full-lenth

I was not expecting so much eukaryota or archaea in the 2 datasets and I decided to pick sequences assigned as Eukaryota with the full-length db and Archaea with the V3-V4 db. Finally in each case sequences are not at all Eukaryota or Archaea but Bacteria (checked by BLASTn vs NR database and SILVA sequence search).

Furthermore when I classify my sequences with qiime feature-classifier classify-consensus-vsearch, everything seems good.
qiime feature-classifier classify-consensus-vsearch
--i-query Metabarcoding_16S/QIIME/2-Dada2/rep-seqs-dada2.qza
--i-reference-reads Databases/SILVA_132_QIIME_release/99-otus.qza
--i-reference-taxonomyDatabases/SILVA_132_QIIME_release/7_level_taxonomy.qza
--p-maxaccepts 0
--p-perc-identity 0.8
--p-query-cov 0.8
--p-strand both
--p-min-consensus 0.51
--p-unassignable-label Unassigned
--p-threads 40
--o-classification Metabarcoding_16S/QIIME/3-Classification/taxonomy_db_full-length_vsearch_classif.qza

I tested this pipeline with qiime2-2018.8 and qiime2-2019.1 with the same results.

How can these results be explained? What is going wrong with my pipeline?

Thank you so much for your replies.

Nicholas_Bokulich · February 8, 2019, 1:20pm

Hi @SoMe,
By any chance are your reads in mixed orientations? The naive Bayes classifier implemented in q2-feature-classifier can really only handle reads that are in a uniform orientation: since the classification is based on kmer profiles, the classifier gets confused when it encounters reads in a reverse orientation — in your case the kmer profiles of reverse reads evidently look more like the kmer profiles of Archaea or Eukaryota than to bacteria. The vsearch+LCA classifier implemented in q2-feature-classifier is not sensitive to this (since it is based on alignment and alignment is checked in both orientations).

The vsearch+LCA classifier is pretty good too — so I recommend just sticking with those results. If you want to use the naive Bayes classifier you will need to figure out a way to reorient the reads that are in reverse orientation... currently we do not have a way to do this in QIIME 2 (also note that mixed read orientation will inflate diversity estimates as well, so are not just a taxonomy problem).

I hope that helps!

SoMe · February 8, 2019, 3:39pm

Thank you very much @Nicholas_Bokulich, it sounds to be a very good explanation!
I have no background with this situation, how can we explain mixed orientation of reads? Do you know a way to get my reads in the same orientation?
I used to use the naive Bayes classifier and I would like to keep it in my pipeline. By the way, I am a little bit confused with the parameters to set for the vsearch+LCA plugin, as the method is very sensitive to them. But perhaps you can help me to set them as best?

Thank you once again!

Nicholas_Bokulich · February 8, 2019, 3:55pm

You can just grab a few reads at random and do an alignment to the reference database sequences to see if they align in the complement or reverse complement orientation.

Not in QIIME 2, but this looks promising: OrientNucleotides: Orient Nucleotide Sequences in DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

This issue crops up every now and then, so I will add a method for read orientation in a future QIIME 2 release. That will make it easier to implement in your pipeline, but for now your best bet is to export and use a tool like decipher, then re-import to QIIME 2 for classification.

See the link to the paper I provided above, which describes the parameters and their effect on classification accuracy. In general, use defaults unless if you are attempting to benchmark, e.g., on a new method or primer set, etc

I hope that helps!

SoMe · February 8, 2019, 4:31pm

Thanks a lot!
I gonna have a look at DECIPHER.