feature-classifier command

Hi,
I have encountered an obstacle while training the Greengenes classifier to more accurately predict taxonomy of my sequencing reads. Any help would be highly appreciated, thanks :smiley: !

So, I realize that in the “qiime feature-classifier extract-reads” command, I need to provide the forward and reverse primer (without fluidigm or barcodes; only the biological sequence). I enquired this info from the sequencing centre-they send me a list of 515f/806r primers, but they use a staggered primer sequences. They have, in total, 4 sets of primers.

So here is what I have thought: I generate a ref-seqs.qza file individually for all 4 sets, and then merge the .qza files to create a composite .qza file which I train using the Native Bayesian approach.

Am I correct in my thinking? Or is there an easier alternate way? Thanks!

Hi @anirban.mcgill,

Welcome to the :qiime2: forum!

I’m assuming by a “staggered approach” you mean that you have multiple regions for the same sequence? You can either use one region and classify with the corresponding classifier (and if you’r doing 515-805, check out the resource page for some pre-trained options). Alternatively, if you want to use all five regions, you may find the Ion Torrent conversation really useful.

Best,
Justine

Hi @jwdebelius

Thank you for your response! QIIME2 is amazing :smiley:
What I meant was that for the 515f/806r region, the sequencing center used these staggered primers sequences :

515FP1-CS1 ACACTGACGACATGGTTCTACAGTGCCAGCMGCCGCGGTAA
515FP2-CS1 ACACTGACGACATGGTTCTACATGTGCCAGCMGCCGCGGTAA
515FP3-CS1 ACACTGACGACATGGTTCTACAACGTGCCAGCMGCCGCGGTAA
515FP4-CS1 ACACTGACGACATGGTTCTACACTAGTGCCAGCMGCCGCGGTAA

806RP1-CS2 TACGGTAGCAGAGACTTGGTCTGGACTACHVGGGTWTCTAAT
806RP2-CS2 TACGGTAGCAGAGACTTGGTCTTGGACTACHVGGGTWTCTAAT
806RP3-CS2 TACGGTAGCAGAGACTTGGTCTACGGACTACHVGGGTWTCTAAT
806RP4-CS2 TACGGTAGCAGAGACTTGGTCTCTAGGACTACHVGGGTWTCTAAT

So, I thought about this yesterday after my post; to train my classifier, I extracted the appropriate portion of the GG database using the 515FP1 and 806RP1 sequences (minus the CS1 and CS2 of course) because I realize the other primers are just 1-2 nucleotides longer than the FP1 and RP1 sequences (which I assume to be parent primers). Is this approach OK?
Thanks,
Anirban

Hi @anirban.mcgill,

Thanks for the clarification! If I were you, I think I’d still just use a pre-trained 515-806 classifier. It’s going to be good enough, especially for your first analysis. (I regularly use generic V34 classifiers for my data that don’t use my exact primers). It saves you time, confusion, and computational pain.

Best,
Justine

Hi @anirban.mcgill,

Sounds like they used the approach similar to Lundberg et al 2013. If you’ve not done so already, just be sure to use q2-cutadapt to remove the 515/806 primer sequences from your reads first. Note you only have to enter in the actual 515/806 sequences themselves (i.e. GTGCCAGCMGCCGCGGTAA & GGACTACHVGGGTWTCTAAT) not every permutation of the additional staggered bases. Cutadapt will remove all the extra bases up to and including the primer sequence. Of-course, this is assuming the primer sequences are a part of your reads.

-Mike

3 Likes

Hi @jwdebelius, @SoilRotifer

Thank you for your help! Yes, I think I did what @SoilRotifer said-used only the actual sequences (the sequencing center already highlighted the CS1 and CS2 regions with green, so I knew the actual primer sequence, which matches the one you refer).

Also, @jwdebelius, I could not use a pre-trained classifier because I was getting an error message for the “scikit learner”. So, I had to retrain my database. I downloaded the 99% otu and taxonomy file from GG_13_8 release, extracted my region of interest (without any truncation so as not to exclude reference read files), and re-trained the extracted region using Bayes Classifier to generate a rep-seqs.qza file. Then everything seems good. Is this approach correct?

Thanks!

2 Likes