Training classifier without primers information

SoilRotifer · September 18, 2020, 5:36pm

You are fine using the length silva references for classification. New SILVA 138 classifiers, and the files used to make them, are available on the Data Resources page. But if you'd like to make an amplicon specific classifier, you can do a few things:

Extract positions from curated silva alignment.
- You can try aligning the first 16 bp of the forward read and the 24 bp of the reverse read (reverse compliment the reverse read) to a subset of the actual curated silva alignment and approximate and extract based on the alignment positions as I outline here here. See step 7. Note: many of these steps, were ported over to RESCRIPt, except for the positional alignment extraction step I'm referring you to. Hopefully, we'll have a native QIIME 2 way of doing this soon.
- This is probably simpler approach: Align a subset of your ESVs (i.e. output from DAD2 / deblur) to a subset of the silva curated aligned silva reference database, and note the positions and extract those from the alignment, as I outline in the link above. Obviously, using the trimming options for DADA2 / deblur to trim the primers off.
Probably the easiest: Use a popular set of v3v4 pimers (which, in all likelihood, are quite similar to the proprietary primers that were used) to extract
the amplicon region:
- 341F: 5'-CCTACGGGNGGCWGCAG-3'
- 805R: 5'-GACTACHVGGGTATCTAATCC-3'
- If you want to see how "close" these extracted reference sequences match-up to your ESVs, you can then align a subset of both the reference sequences and the ESVs so see how much longer or shorter one set is to the other, then modify accordingly.
- Caveat: there may be differences in primer bias when using the above primers to extract the sequences from the reference data set as compared to the primers actually used to sequence your data.

I hope this helps!
Mike