Hello @Imindias,
If you know that your sequences are 16S, that pre-trained classifier should be good enough. Training a classifier trimmed to the exact primer region only boosts accuracy a little bit for 16S... the 30% of sequences that are not classifying are probably not due to the use of this pre-trained classifier. It is much more likely that these are non-target DNA (e.g., contaminants).
Are your reads in mixed orientations? (i.e., both forward and reverse reads on the 16S?) that can cause trouble for classify-sklearn
(which must then be trained on mixed orientation reads) but not for classify-consensus-vsearch
or blast. So that could also explain this 30% unclassified. (more details below)
See this thread. That user is also using Ion Torrent data and it sounds like they ran into similar issues as you, in particular with mixed-orientation reads.
Based on that thread, my advice is to use the vsearch classifier with greengenes 99% OTUs. You can use the --p-threads
parameter to run multiple jobs, speeding up this analysis, if your system can support that. Unfortunately, this approach can be time-consuming... in my experience it should take far less than 36 hours to complete on a normal sized run, but perhaps you have a very large dataset? Aligning against full-length 16S sequences will also dramatically increase runtime, since alignment is computationally expensive. So trimming the reference reads to your primers should seriously speed up this analysis.
I do not know what primers the Ion Torrent 16S kit uses. @MMC_northS do you know? It sounds like you are maybe using the same kit for Ion Torrent sequencing?
I hope that helps!