Hello,
I am working with Ion Torrent data with @Jen_S and am trying to create a DADA2 workflow using QIIME2 to compare with our previous QIIME1 pipeline. Some of our previous workflow issues have been resolved already in this forum (thanks @Nicholas_Bokulich !) but we are getting stuck at the point where we want to train our classifier.
We have datasets using IonTorrent data that have mixed orientation reads and single end sequencing amplicons targeting two different V regions (4 primers total). The data has ATCC mock sequences that were run with other samples. There are 2 mocks (1 even, 1 staggered) per run with a total of 14 runs (28 mock samples)
Here is the workflow:
• Data import into QIIME2 as .qza
• Cutadapt performed for 2 forward and 2 reverse regions on demultiplexed files
• Run DADA2 for each run and each primer
Note, we just found out about qiime dada2 denoise-pyro in here, which we plan to
use instead of what we were previously using ( qiime dada2 denoise-single )
•Then we merged all of our forward and reverse reads for each V region using qiime feature-table merge with the --p-overlap-method sum flag to get merged_v2_table.qza and merged_v2_rep_seqs.qza
Next we want to train our classifier—we are just running into a couple of issues/questions:
We read that sk-learn does not perform well on mixed orientation reads so were thinking about using qiime feature-classifier classify-consensus-vsearch
In the feature classifier tutorial, there is a sequence extraction step from Greengenes (what we plan on using). We were able to get import the otus fasta into 99_otus.qza along with the reference taxonomy, but are stuck at the “extract reads” step below
qiime feature-classifier extract-reads
–i-sequences 99_otus.qza
–p-f-primer *XXXXXXXXXX*
–p-r-primer *XXXXXXXXXX *
–p-trunc-len *250 *
–p-min-length 100
–p-max-length 400
–o-reads **v2_**ref-seqs.qza
I bolded the parts that we would adjust above based on our primers. Can we input both our forward and reverse primers since we are using the merged table, even if we aren’t working with paired-end sequences? In DADA2 our trunc-lenth was 250 so want to keep that consistent.
Then we train our classifier using classify-consensus-vsearch. Do you recommend that we use the defaults to start like the tutorial does? I put in the classifier training/testing commands with my understanding of the data we need in parentheses below each command)
qiime feature-classifier classify-consensus-vsearch
–i-query ARTIFACT FeatureData[Sequence] Sequences to classify taxonomically.
( --i-query merged_v2_rep_seqs.qza #output from DADA2 after table merge)
–i-reference-reads ARTIFACT FeatureData[Sequence] reference sequences.
( --i-reference-reads v2_ref-seqs.qza \ #from extract-reads command)
–i-reference-taxonomy ARTIFACT FeatureData[Taxonomy] reference taxonomy labels.
( --i-reference-taxonomy ref-taxonomy.qza \ #import from Greengenes)
–o-classification ARTIFACT FeatureData[Taxonomy]
(–o-classifier v2_classifier.qza )
qiime feature-classifier classify-sklearn
–i-classifier v2_classifier.qza \ (#from above)
–i-reads v2_rep_seqs.qza \ (#output from DADA2 after table merge)
–o-classification v2_taxonomy.qza
qiime metadata tabulate
–m-input-file v2_taxonomy.qza
–o-visualization v2_taxonomy.qzv
Do these commands look correct?
Finally, while doing research on the above we came across a new pipeline qiime feature-classifier classify-hybrid-vsearch-sklearn https://docs.qiime2.org/2020.2/plugins/available/feature-classifier/classify-hybrid-vsearch-sklearn/
What is the difference between this and the standard v-search classifier? Can we use this on our mixed orientation reads? If I understand correctly, this pipeline combines the “Train the classifier” and “test the classifier” steps since a pre-trained sklearn classifier used from data resources, correct?
qiime feature-classifier classify-hybrid-vsearch-sklearn
–i-query ARTIFACT FeatureData[Sequence] Sequences to classify taxonomically.
( --i-query merged_v2_rep_seqs.qza #output from DADA2 after table merge)
–i-reference-reads ARTIFACT FeatureData[Sequence] reference sequences.
( --i-reference-reads v2_ref-seqs.qza \ #from extract-reads command)
–i-reference-taxonomy ARTIFACT FeatureData[Taxonomy] reference taxonomy labels.
–i-reference-taxonomy ref-taxonomy.qza \ #import from Greengenes)
–i-classifier ARTIFACT TaxonomicClassifier Pre-trained sklearn taxonomic classifier for
classifying the reads.
(–i-classifier gg-13-8-99-nb-classifier.qza #from QIIME2 data resources)
–o-classification ARTIFACT FeatureData[Taxonomy] The resulting taxonomy
classifications.
( --o-classification v2_taxonomy.qza )
qiime metadata tabulate
–m-input-file v2_taxonomy.qza
–o-visualization taxonomy.qzv
Thanks so much for your help, I realize this is a long and multi-part question!