Should I truncate all sequences to the same length for doing ITS?

Dear Qiime team,

Thanks first for all your unfailing supports!

Before doing the taxonomy for ITS, should I truncate all my sequences to the same length before using the classifier? Because I heard that when people using Mother, they need to drop those sequences of smaller length by truncating all sequences to the same length. So that all sequences will have the same starting point and kind of aligned. Overall, my question will be is it OK if I use sequences that vary in length (200~450 bp) to do the taxonomy, any bias associated with this? Can you please introduce some knowledge here that I should care about dealing with ITS data?


Hi @hongwei2017,
No, you do not need to truncate the sequences prior to taxonomic classification in QIIME 2. Just make sure that the reference database you are using encompasses the full length of the ITS sequences that you are attempting to classify.

The issues with truncated prior to alignment in mothur sounds more like an issue with conforming length prior to OTU picking — but I am not certain based on your description. I would be interested to see the original articles recommending this if you could post links here.


1 Like

Hi @Nicholas_Bokulich
Thanks for this information, I was not confident with the taxonomy results I got. Actually, I saw one very dominant group which is Fungi_Ascomycota in my feature table, which has a proportion of ~50%. It could only give me to phylum level but not any deeper. This is one of my concern. Is it normal? I don’t know how to lose the parameters to make taxonomy into deeper classification in this case.


Hi @hongwei2017,
Check out this preprint, which describes optimal classifier settings for ITS classification with q2-feature-classifier (focus on results for “naive bayes” if you are using the sklearn-classify method).

Deeper classification is not always better, depending on what your goals are. If your samples contain poorly resolved or unidentified species, then shallow classifications may be the most accurate result, and deeper classifications could be false positives (e.g., overclassification). If false positives are not a major concern (e.g., you would rather erroneously classify an unknown sequence to a near neighbor if the true class is unknown than receive an ambiguous shallow assignment), check out the “high recall” settings described in that preprint — and make sure you report this in your results to make it clear that you are using settings prone to high false-positive errors.

If you are using the optimal ITS settings described in that preprint and still get very shallow classifications, I would stick with them and focus on sequence variants instead of taxonomic groups for differentiating sample types.

Good luck!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.