Sequence extraction and Classifier training for Ion Torrent data

I am working with Ion Torrent data with @Jen_S and am trying to create a DADA2 workflow using QIIME2 to compare with our previous QIIME1 pipeline. Some of our previous workflow issues have been resolved already in this forum (thanks @Nicholas_Bokulich !) but we are getting stuck at the point where we want to train our classifier. :exploding_head:

We have datasets using IonTorrent data that have mixed orientation reads and single end sequencing amplicons targeting two different V regions (4 primers total). The data has ATCC mock sequences that were run with other samples. There are 2 mocks (1 even, 1 staggered) per run with a total of 14 runs (28 mock samples)

Here is the workflow:
• Data import into QIIME2 as .qza
• Cutadapt performed for 2 forward and 2 reverse regions on demultiplexed files
• Run DADA2 for each run and each primer
Note, we just found out about qiime dada2 denoise-pyro in here, which we plan to
use instead of what we were previously using ( qiime dada2 denoise-single )
•Then we merged all of our forward and reverse reads for each V region using qiime feature-table merge with the --p-overlap-method sum flag to get merged_v2_table.qza and merged_v2_rep_seqs.qza

Next we want to train our classifier—we are just running into a couple of issues/questions:

We read that sk-learn does not perform well on mixed orientation reads so were thinking about using qiime feature-classifier classify-consensus-vsearch
In the feature classifier tutorial, there is a sequence extraction step from Greengenes (what we plan on using). We were able to get import the otus fasta into 99_otus.qza along with the reference taxonomy, but are stuck at the “extract reads” step below

qiime feature-classifier extract-reads
–i-sequences 99_otus.qza
–p-f-primer *XXXXXXXXXX*
–p-r-primer *XXXXXXXXXX *
–p-trunc-len *250 *
–p-min-length 100
–p-max-length 400
–o-reads **v2_**ref-seqs.qza

I bolded the parts that we would adjust above based on our primers. Can we input both our forward and reverse primers since we are using the merged table, even if we aren’t working with paired-end sequences? In DADA2 our trunc-lenth was 250 so want to keep that consistent.

Then we train our classifier using classify-consensus-vsearch. Do you recommend that we use the defaults to start like the tutorial does? I put in the classifier training/testing commands with my understanding of the data we need in parentheses below each command)

qiime feature-classifier classify-consensus-vsearch
–i-query ARTIFACT FeatureData[Sequence] Sequences to classify taxonomically.
( --i-query merged_v2_rep_seqs.qza #output from DADA2 after table merge)
–i-reference-reads ARTIFACT FeatureData[Sequence] reference sequences.
( --i-reference-reads v2_ref-seqs.qza \ #from extract-reads command)
–i-reference-taxonomy ARTIFACT FeatureData[Taxonomy] reference taxonomy labels.
( --i-reference-taxonomy ref-taxonomy.qza \ #import from Greengenes)
–o-classification ARTIFACT FeatureData[Taxonomy]
(–o-classifier v2_classifier.qza )

qiime feature-classifier classify-sklearn
–i-classifier v2_classifier.qza \ (#from above)
–i-reads v2_rep_seqs.qza \ (#output from DADA2 after table merge)
–o-classification v2_taxonomy.qza

qiime metadata tabulate
–m-input-file v2_taxonomy.qza
–o-visualization v2_taxonomy.qzv

Do these commands look correct?

Finally, while doing research on the above we came across a new pipeline qiime feature-classifier classify-hybrid-vsearch-sklearn

What is the difference between this and the standard v-search classifier? Can we use this on our mixed orientation reads? If I understand correctly, this pipeline combines the “Train the classifier” and “test the classifier” steps since a pre-trained sklearn classifier used from data resources, correct?

qiime feature-classifier classify-hybrid-vsearch-sklearn
–i-query ARTIFACT FeatureData[Sequence] Sequences to classify taxonomically.
( --i-query merged_v2_rep_seqs.qza #output from DADA2 after table merge)
–i-reference-reads ARTIFACT FeatureData[Sequence] reference sequences.
( --i-reference-reads v2_ref-seqs.qza \ #from extract-reads command)
–i-reference-taxonomy ARTIFACT FeatureData[Taxonomy] reference taxonomy labels.
–i-reference-taxonomy ref-taxonomy.qza \ #import from Greengenes)
–i-classifier ARTIFACT TaxonomicClassifier Pre-trained sklearn taxonomic classifier for
classifying the reads.
(–i-classifier gg-13-8-99-nb-classifier.qza #from QIIME2 data resources)
–o-classification ARTIFACT FeatureData[Taxonomy] The resulting taxonomy
( --o-classification v2_taxonomy.qza )

qiime metadata tabulate
–m-input-file v2_taxonomy.qza
–o-visualization taxonomy.qzv

Thanks so much for your help, I realize this is a long and multi-part question! :innocent:


Welcome to the forum @KMaki!

That’s a good call! This is what I’ve advised elsewhere for ion torrent data.

Note that that tutorial is for training classifiers for use with classify-sklearn. There is no training step if you plan to use classify-consensus-vsearch since it just aligns against a reference sequence database. More on this below.

That said, extracting reads is perfectly okay — it will speed up the alignment step if you know the V-region you are classifying. But extracting reads probably will not impact accuracy too much (it’s possible! but not too likely)

Yes that’s fine.

Drop the trunc-len here, though: since your reads are in mixed orientations those 250 nt could be aligning against any section of the target amplicon, not just the 250 nt at the front end of the amplicon.

Yes! The defaults were optimized for 16S and ITS sequences as described in the main publication for the q2-feature-classifier plugin that describes all of the classification methods developed for the plugin:

But I’ve added new features since then, like options to use only top hits… fiddle with the parameters a bit, since you have mock communities you have an opportunity to evaluate whether altering the parameters improves your results (and see the q2-quality-control plugin and tutorial for methods to evaluate accuracy).

As noted above, there is no classifier training step. classify-consensus-vsearch is the classification step. This is distinct and separate from classify-sklearn, which requires a classifier training step, as it uses different machine learning models to predict taxonomic affiliation based on kmer frequency profiles. classify-consensus-vsearch just uses VSEARCH to align against a database, followed by a consensus taxonomy assignment step in q2-feature-classifier. No training required! You can see the overview tutorial at for a flowchart depicting the differences between these methods.

In practice it means this:

This is not a classifier, this is your taxonomic classification of each input sequence

This will not work, since this is a FeatureData[Taxonomy], not a trained classifier.

This performs an exact match search with classify-consensus-vsearch, and anything that does not 100% match a reference sequence is re-classified with classify-sklearn. It is still a somewhat experimental method.

In your case, probably not: since the reads are in mixed orientation you have issues orienting for exact match and classify-sklearn. The exact match requires 100% end-to-end exact match, so needs to be trimmed at precisely the same positions, but since your reads are single-end mixed-orientation truncated at 250 nt it sounds like you do not know whether this will align to the front or back end of the reference read simulated amplicons.

But in general yes this could work if paired-end reads are used (whether or not they are in mixed orientation) or single-end reads that are in the same orientation.

So bottom line for your analysis: just run classify-consensus-vsearch (either on the extracted reads or not) and you are all done!

Thank you so much @Nicholas_Bokulich !
This helped classify things a lot. Just to make sure I am 100% clear, if I am running vsearch, what artifacts are the inputs 2 & 3 coming from in the qiime feature-classifier classify-consensus-vsearch command:
qiime feature-classifier classify-consensus-vsearch
–i-query merged_v2_rep_seqs.qza (this is the representative-sequences artifact from
DADA2 after all runs are merged)
–i-reference-reads ARTIFACT FeatureData where does this come from?
–i-reference-taxonomy ARTIFACT FeatureData[Taxonomy] reference taxonomy labels
where does this come from?
–o-classification v2_feature_classification.qza

Thank you!


These are the imported reference sequences (e.g., from Greengenes), optionally after extracting the target amplicon.

Imported reference taxonomy that matches the sequences that you are inputting.

I hope that helps!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.