Dealing with numbers at genus level when using collapse taxa

Hi qiime community,

I looking to use the taxa collapse plugin to collapse my feature table but have run into an issue with extra data at the genus level in my taxonomy strings. For example I see things like
D_0__Bacteria;D_1__Cyanobacteria;D_2__Oxyphotobacteria;D_3__Nostocales;D_4__Nostocaceae;D_5__Aphanizomenon MDT14a
D_0__Bacteria;D_1__Cyanobacteria;D_2__Oxyphotobacteria;D_3__Nostocales;D_4__Nostocaceae;D_5__Aphanizomenon NIES81

The genus is Aphanizomenon however there are the extra strain IDs at the end and therefore these two Aphanizomenon features will not collapse into the one genus. Is there any way to get around this without having to modify the whole taxonomy table manually? I have around 10,000 features. These were identified with a classifier trained from silva

Thanks in advance for the help!

Hey @jjankowiak,

Unfortunately no, that scheme is declaring them to be two different genera.

I wonder if something went slightly wrong upstream of the classifier. It seems strange to me that there would be whitespace in the taxonomy label, so I wonder if the taxonomy used was mapping taxonomy strings to OTUs (rather than vice-versa).

Could you provide your reference FeatureData[Taxonomy]? The one used to train the classifier?

1 Like

I think that's just a quirk with SILVA — lots of whitespace in the taxonomy labels, and lots of taxa with strain IDs in the taxonomy label.

@jjankowiak SILVA is better in other ways (e.g., updated more recently, frequently), but this is one reason why I prefer Greengenes in many cases — the taxonomy labels are more uniform.

The best course of action would be to modify the taxonomy labels before training the classifier, e.g., to remove strain designations.

I hope that helps!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.