I am about to train my own classifier with primer set 341F (CCTACGGGNGGCWGCAG) and 805R (GACTACHVGGGTATCTAATCC). I have two questions, first which databases do you recommend, SILVA or Greengene? Second, I used different values to truncate my forward and reverse reads (during denoising). And, based on the tutorial, ‘‘For classification of paired-end reads and untrimmed single-end reads, we recommend training a classifier on sequences that have been extracted at the appropriate primer sites, but are not trimmed’’. I am a bit confused that what does it mean by “…but are not trimmed”? Shouldn’t I use the trimmed req-seqs for this?
SILVA vs. Greengenes - it depends, and sometimes it depends on the reviewer. I will tell you that training classifiers with either repository is great and easy. It is … roughly easy to do both … and I would reocmmend that both be put into your pipelines. SILVA is updated regularly and generally more accepted, but Greengenes is conservative and hasn’t been updated since … 2013(?) I am not sure.
Truncating ends will depend on your quality at the ends of your forward and reverse reads. a) I believe it is saying NOT to trim the training classifiers.
Per my previous comment on using the pretrained classifier silva-138-99 I managed to remove the sklearn version error and the script gone running for 5h, howerver at the end failed to an ERRno 17 file exists: Plugin error from feature-classifier:
You may have to set up your .bashrc / .bash_profile with your TMPDIR path? Different HPC systems allow / don't allow certain setups. So you should check with your system admins about how to dynamically set up your temporary paths.
Thanks, I thought the export TMPDIR=/home/farhad1990/faststorage/data/temp/ would direct the analysis into my defined temp directory, now I can see that for some reason it is still using the default one. Thanks anyway I will contact our system admin, and have a good weekend