Training the Classifier-ITS data

Hello,

I have a quick question in regards to training the classifier. I am working with ITS data and the note on the tutorial, states that "training on the UNITE reference database does NOT benefit from extracting/trimming reads to primer sites, {as such, it is] recommend training UNITE classifiers on the full reference sequences"

My questions are the following:

  1. How can I be sure that my training work, or what confidence level should I attain to make sure that I am able to continue running the taxonomy analysis on my data?

  2. Since I am not using the trimmed data, would my code then not have the --p-trunc-len perimeter?

  3. I am working with paired-end, Illumina demultiplexed files, as such, the 5'end have already been removed, so I want to verify what info I need to enter for my p-f primer and p-r-primer? Would it be the complete sequence w/o pad?

THANKS!

If the taxonomy classifications you receive look reasonable, it worked. If something doesn’t look right, let’s discuss then.

Just use the default parameter settings. This was benchmarked on fungal mock community datasets and the parameter settings pretty much work well for all amplicon sequences. You could check out the q2-feature-classifier paper to read about more specific settings for fungi; but the defaults should “just work”.

You can just skip extract-reads altogether; that’s what the note above is about. In our experience, extracting, e.g., the ITS1 domain does not measurably increase accuracy with the method used in classify-sklearn.

I hope that helps!

Hi Nicholas,

Just to verify that I am understanding correctly when following the tutorial, I would completely skip the “Extract Reference Reads” portion of the tutorial, and go directly to "Train the Classifier", correct?

If so, the Train the Classifier requires a file named, _ref-seqz.qza_which is created during the Extract reference reads. If I do not need to do extract-reads, what file do I use for training?

Update: I ran the following code and was able to "train the classifier, but per the previous comments, I am unsure if I did it correctly. If you could please let me know if it was done properly, I would really appreciate it. Note: I did get a lot of unknowns, is this common?

Thanks

Correct.

You would use the untrimmed reference sequences, in other words the sequences that would be input to extract-reads if you wanted to go that route.

The idea is that the classifier is being trained to learn the taxonomy of some reference sequences — any kind of FeatureData[Sequence] artifact that has a matching FeatureData[Taxonomy] artifact. These sequences can be trimmed — by extract-reads — or you can use the untrimmed sequences in the same way. Does that make sense?

The overview tutorial might help clarify this process. The flowchart gives an example using extract-reads but you can imagine the same process bypassing that step.

If you get “satisfying” results, it worked. You probably did things correctly. Some unknowns is normal, but many unknowns can be an indicator of larger problems (just search around this forum to get some good explanations of what can go wrong — usually it’s using the wrong classifier). You can share a qiime taxa barplot QZV here if you want some reassurance, but it sounds like you probably have just some unknowns, which is normal (these are probably non-target DNA, e.g., plant DNA that is amplified by ITS primers. You can filter out all unknowns and proceed).

I hope that helps!

So what you are saying is, skip this step

image

But use the 85_otus.qza file for the following step, to train the classifier, correct? Of course, I would use my own files, but per this tutorial, these are the files I am interchanging, correct? So my code would look as follows:

image

Lastly, the test the classifier portion is the same as the Taxonomy step in the moving pictures tutorial, correct? I mean the codes are the same, so I just want to verify.

correct

correct

correct

yes — the sequence classification stage would look the same no matter how the classifier is trained. “Test the classifier” is probably a misleading name there.

I hope that helps! Let me know if you are still having concerns after following those steps.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.