I am performing an analysis where I am comparing different amplicon regions (V1-V3, V3-V4, and V4-V5) on the same mock community that contains 8 bacterial and 2 fungal species. As part of the workflow, I trained my own feature classifiers using Silva 138 SSURef NR99 full-length sequences and corresponding taxonomy file available on the data resources page. I extracted the appropriate regions using the primer sequences to train each classifier. The process worked for V1-V3 and V3-V4 but it does not appear to be working for the V4-V5 region. All the reads are either unassigned or assigned only to the domain level.
These are the commands I used to extract the appropriate amplicon region from the reference database and train the feature classifier. The primers correspond to positions 515-926 and I've double-checked to ensure the sequences are correct.
Can you provide more details on how your data was sequenced? Some sequencing facilities provide data in "mixed orientation". Meaning a portion of your reads are flipped in the other direction. When trying to use the naive-bayes classifier on these reads, you'll obtain a lot of poorly or unassigned reads, as your reads are not in the same orientation as the reference sequences. If this is the case, you'll have to find a way to correctly orient your reads prior to classification etc... probably a good idea to do this before denoising.
If that is not the case, and the taxonomic assignments for the other amplicon regions make sense, then it is likely an issue with the V4V5 primers not matching very well to the reference sequences. That is, too many mismatches. Using primers to extract your target region can be an issue in some cases, as we warn here. I recommend trying this approach to building your classifier.
Thank you for the helpful information. I will be sure to follow up with those methods. The version of QIIME2 currently running on our HPC is quite out of date and will need to be updated before I proceed.
However, while I was waiting for a software update, I recalled I had previously analyzed this data using the older SILVA 132 reference data but with a newer version of QIIME2 than the one I am currently running. This analysis as done in an identical manner where the amplicon region was extracted using the primer sequence. These extracted reads were then used as the input to train the classifier. This was very successful as you can see in the bar plot.
For due diligence I took those extracted reads and taxonomy file from the SILVA 132 release that worked for me previously, and used then to train the feature classsifier using the older version of QIIME2 I am currently running. To my surprise it did not work at all. I got the same issue where everything was only classified to domain level. Very strange.
I am not sure what could be causing this. It's the same samples, and same region that was extracted using the same primer region on the same reference database (SILVA 132). The only difference is the version version of QIIME2, and slightly different denoising parameters.
Would you be willing to share with me your QZVs? You can do so here or via private DM. Based on discussions with another moderator, some thoughts:
Check perhaps change the values for your feature-classifier extract reads command. I think --p-min-length 315 might be too long and --p-max-length 515 might be too short. I'd set these to --p-min-length 250 --p-max-length 600 to allow some breathing room.
Try running rescript evaluate-seqs on the sequences you are using to construct your reference database. This will give us an idea of the length distribution of your reference sequences, and might signal if your min / max lengths need adjusting.