training classifier on v3-v4 region

arar · June 30, 2019, 9:01am

hello
1-i downloaded v3-v4 classifier that @ Mehrbod_Estaki performed
(the link: Available: Pre-trained classifier of V3-V4 (341F, 805R) region with gg_99)
,it was downloaded as .qza file it can't be extracted, i want get the fasta file and taxonomy file

i am also downloaded classifier trained on silva v3-v4 from this link silva_132_99_v3v4.qza - Google Drive
i extracted it but also i can't know where to find .fasta file or taxonomy file?

2- i read all posts about training feature classifier and training classifier tutorial,but i still can't fully understand the difference between single end read and paired end read
(from the tutorial: Note The --p-trunc-len parameter should only be used to trim reference sequences if query sequences are trimmed to this same length or shorter. Paired-end sequences that successfully join will typically be variable in length. Single-end reads that are not truncated at a specific length may also be variable in length. For classification of paired-end reads and untrimmed single-end reads, we recommend training a classifier on sequences that have been extracted at the appropriate primer sites, but are not trimmed.)
**what is the meaning of training a classifier on sequences that have been extracted at the appropriate primer sites, but are not trimmed?**it means ,if my read after triming is 350 for example the command will be as the following?
i-squence 99_otus.qza
–p-f-perimer CCTACGGGNGGCWGCAG
–p-r-primer GACTACHVGGGTATCTAATCC
–p-trunc-len 0
–p-min-length 300
–p-max-length 400
–o-reads ref-seqs.qza

Mehrbod_Estaki · June 30, 2019, 11:22pm

Hi @arar,
The link you're using is a bit outdated and I recently uploaded a more recent version of the V3-V4 greengenes classifier which you should use instead, since that was trained using the most updated version of scikit bio.

Unless you are using this outside of qiime2 I would recommend just keeping it as is. This classifier is ready to be used in qiime2 pipelines. If you are looking for the underlying fasta file to use elsewhere you can find it in the data folder after you've unzipped the artifact.

As for your other requests I'm not sure exactly what you're asking but here is some info with regards to the topic.
The reference sequences (Greengenes or Silva) are made up of the complete 16S region. It is recommend that instead of training your classifier on the full 16S region, you train it on only on the specific region which your primers targeted. In your case the V3-V4 region. So, in the first step of preparing the linked classifier I extracted the V3-V4 region from the full 16S references using extract-reads action.
The primers I used were
F: CCTACGGGNGGCWGCAG,
R: GACTACHVGGGTATCTAATCC
min length 30 and no max-length. Not setting a max length here is because once we use paired-end reads, the V3-V4 region we extract is variable in length. If we were to set a max length of say lets say 400 here then your actual sequences (query) that were also paired-end may have reads that are longer than 400 bp and this makes classifying them less accurate. So its better for your reference sequences to be at least longer or of the same region (recommended).
If one were to use single-end reads and trim them all to 250 bps, then we know that there is no variability and so we could thus do the same with our classifier and extract the same region and trim them also to 250 bp.

You actually don't need to do this step at all because the classifier in question has already been extracted for that region and has already been trained. You can just skip to the taxonomy assignment step using them. But if you are asking for the sake of learning, then I would say just don't set a max-length and leave the min-length at something low like 50.

arar · July 5, 2019, 5:54am

thank you very much that is really valuable , i will try the classifier again