Can species/genus/family boundary values be modified for analysis of functional marker genes?

SandraQ2 · July 22, 2020, 1:54pm

Hello QIIME team,

I hope I am posting under a suitable category. I am analyzing a functional marker gene for which I have generated a custom database encompassing seven taxonomic ranks (same as for 16S rRNA genes). I used dada2 for quality filtering and joining of Illumina 2x 250 bp PE reads. Taxon assignment (via classify-consensus-vsearch) in QIIME2 runs smoothly, however, my output relative abundance table does not show any taxa such as "Genus name, other". Either my sequences are classified at the species level or, if not, they immediately go into the "Unassigned" bin. I assume this may be due to the higher discriminating power of a functional marker as compared to e.g. 16S rRNA genes. Is it possible to set different-to-the-default thresholds for species-, genus-, family-level etc. thereby obtaining fewer unassigned and understanding better at which rank those particular sequences diverge from the reference sequences?

Thank you very much for your reply.
Sandra

Nicholas_Bokulich · July 22, 2020, 2:59pm

Hi @SandraQ2,
Indeed, this sounds like an issue with adapting the classify-consensus-vsearch classifier for the functional gene you are working with. I recommend modifying some of the parameters to see how this changes your results — this is more or less the answer to the question posed in the title of your topic. if you have a mock community or other test data to work with it would be ideal for guiding your selections.

Great! Probably due to the better resolution of your marker gene

Could be a parameter issue, e.g., see the perc-identity and query-cov parameters.

Or could be a database issue, e.g., if the species in the database do not encompass all species you are detecting.

Or it could be an issue with non-target DNA! I recommend reading some of the archival topics on this forum for more details on troubleshooting unclassified features... see Frequent Questions and "Best of the QIIME 2 Forum"

Please let us know what you find!

SandraQ2 · August 17, 2020, 1:46am

Hello @Nicholas_Bokulich,

thank you very much for your reply and suggestions. I went back and modified the parameters you indicated above, especially, I have tested various perc-identity thresholds from 90% up to 97%. I do see an increasing number of unassigned sequences when increasing perc-identity. But I still do not observe any sequences classified to the genus/family/order level only. I also analysed the unassigned sequences in more depth. I found that about 50% of unassigned representative sequences are non-target sequences, as you suggested. The other half appears to stem from the target gene, and they look similar enough to calculate them into a phylogenetic tree of the target gene. It would be great if they could be taxonomically assigned to "genus,other" or "family,other", so that it is easier to follow up on potentially novel sequence types. In my current dataset, the total number of these representative sequences representing novel sequence types is very low though.

Thank you very much again and kind regards,
Sandra

Nicholas_Bokulich · August 21, 2020, 8:53am

This must be a characteristic of the functional gene that you are targeting. Either individual species have a very high degree of differentiation, or else the database is very incomplete.

You could drop the perc-identity threshold much lower (80%?) and see what happens. These underclassifications will only occur if a sequence is aligning to >1 reference sequences with distinct taxonomies...

Alternatively, maybe what you want to do is define a dynamic similarity threshold to determine whether it is a species match, genus, family, etc. This would be rather laborious but you could do multiple passes with classify-consensus-vsearch with different thresholds, something like this:

classify all sequences at perc-identity X (where X = the threshold you want to consider for a species match) --> output A
filter out unclassifieds A from output A
classify unclassified A at perc-identity Y (where Y = the threshold you want to consider for a genus match) --> output B
filter out unclassifieds B from output B
classify unclassified B at perc-identity Z (where Z = the threshold you want to consider for a family match) --> output C
et cetera

Let me know if that works!