Classifier training parameters

Crap. I have everything backwards... Three things:

  1. @BenKaehler and @Nicholas_Bokulich should have received a direct message from me with a link to the reference sequences and taxonomy files. Hopefully this helps with troubelshooting.

  2. I was wrong with regards to which of my classifier tests actually finished:

The classifier script that finished was for the trimmed references, not for the untrimmed. It was this trimmed dataset that is apparently vastly smaller - just 10,073 sequences as opposed to 2 million. The input into classifier training, and the eventually finished classifier test which I thought had too few reads (the ~10k reads, not the 2 million) was from this trimmed dataset. So, as you'd expect, my subsequent trained classifier and input read set were both just 10K long. No technical glitches, just a dumb user.

Yet I'm unclear why so few sequences made the cutoff - the parameters I used were:

qiime feature-classifier extract-reads \
  --i-sequences "$REFSEQ" \
  --p-f-primer GGTCAACAAATCATAAAGATATTGG \
  --p-r-primer GGWACTAATCAATTTCCAAATCC \
  --p-trunc-len 181 \
  --p-min-length 160 \
  --p-max-length 220 \
  --o-reads ref_seqs_all_trim.qza

I expect that my amplicons are all about 180 bp long, so I chose those parameters to try to target the correct fragment sizes. Perhaps I'm making a mistake with the nature of how the forward and reverse barcodes are implemented. Are the f-primer and r-primer supposed to be the 5' --> 3' orientation for both? Is the r-primer supposed to be the reverse complement like with most read trimming programs? It's not clear in the documentation, but perhaps that's just one of those things you're supposed to know.

  1. Am I correct that neither of the extract-reads nor naive-classifier functions are multithreaded?

Thanks!

1 Like