I have encountered an obstacle while training the Greengenes classifier to more accurately predict taxonomy of my sequencing reads. Any help would be highly appreciated, thanks !
So, I realize that in the “qiime feature-classifier extract-reads” command, I need to provide the forward and reverse primer (without fluidigm or barcodes; only the biological sequence). I enquired this info from the sequencing centre-they send me a list of 515f/806r primers, but they use a staggered primer sequences. They have, in total, 4 sets of primers.
So here is what I have thought: I generate a ref-seqs.qza file individually for all 4 sets, and then merge the .qza files to create a composite .qza file which I train using the Native Bayesian approach.
Am I correct in my thinking? Or is there an easier alternate way? Thanks!
I'm assuming by a "staggered approach" you mean that you have multiple regions for the same sequence? You can either use one region and classify with the corresponding classifier (and if you'r doing 515-805, check out the resource page for some pre-trained options). Alternatively, if you want to use all five regions, you may find the Ion Torrent conversation really useful.
So, I thought about this yesterday after my post; to train my classifier, I extracted the appropriate portion of the GG database using the 515FP1 and 806RP1 sequences (minus the CS1 and CS2 of course) because I realize the other primers are just 1-2 nucleotides longer than the FP1 and RP1 sequences (which I assume to be parent primers). Is this approach OK?
Thanks for the clarification! If I were you, I think I’d still just use a pre-trained 515-806 classifier. It’s going to be good enough, especially for your first analysis. (I regularly use generic V34 classifiers for my data that don’t use my exact primers). It saves you time, confusion, and computational pain.
Sounds like they used the approach similar to Lundberg et al 2013. If you’ve not done so already, just be sure to use q2-cutadapt to remove the 515/806 primer sequences from your reads first. Note you only have to enter in the actual 515/806 sequences themselves (i.e. GTGCCAGCMGCCGCGGTAA & GGACTACHVGGGTWTCTAAT) not every permutation of the additional staggered bases. Cutadapt will remove all the extra bases up to and including the primer sequence. Of-course, this is assuming the primer sequences are a part of your reads.
Thank you for your help! Yes, I think I did what @SoilRotifer said-used only the actual sequences (the sequencing center already highlighted the CS1 and CS2 regions with green, so I knew the actual primer sequence, which matches the one you refer).
Also, @jwdebelius, I could not use a pre-trained classifier because I was getting an error message for the “scikit learner”. So, I had to retrain my database. I downloaded the 99% otu and taxonomy file from GG_13_8 release, extracted my region of interest (without any truncation so as not to exclude reference read files), and re-trained the extracted region using Bayes Classifier to generate a rep-seqs.qza file. Then everything seems good. Is this approach correct?