training classifier on v3-v4 region

Hi @arar,
The link you're using is a bit outdated and I recently uploaded a more recent version of the V3-V4 greengenes classifier which you should use instead, since that was trained using the most updated version of scikit bio.

Unless you are using this outside of qiime2 I would recommend just keeping it as is. This classifier is ready to be used in qiime2 pipelines. If you are looking for the underlying fasta file to use elsewhere you can find it in the data folder after you've unzipped the artifact.

As for your other requests I'm not sure exactly what you're asking but here is some info with regards to the topic.
The reference sequences (Greengenes or Silva) are made up of the complete 16S region. It is recommend that instead of training your classifier on the full 16S region, you train it on only on the specific region which your primers targeted. In your case the V3-V4 region. So, in the first step of preparing the linked classifier I extracted the V3-V4 region from the full 16S references using extract-reads action.
The primers I used were
F: CCTACGGGNGGCWGCAG,
R: GACTACHVGGGTATCTAATCC
min length 30 and no max-length. Not setting a max length here is because once we use paired-end reads, the V3-V4 region we extract is variable in length. If we were to set a max length of say lets say 400 here then your actual sequences (query) that were also paired-end may have reads that are longer than 400 bp and this makes classifying them less accurate. So its better for your reference sequences to be at least longer or of the same region (recommended).
If one were to use single-end reads and trim them all to 250 bps, then we know that there is no variability and so we could thus do the same with our classifier and extract the same region and trim them also to 250 bp.

You actually don't need to do this step at all because the classifier in question has already been extracted for that region and has already been trained. You can just skip to the taxonomy assignment step using them. But if you are asking for the sake of learning, then I would say just don't set a max-length and leave the min-length at something low like 50.

4 Likes