important features heatmap - taxonomy

LenaLapidot · July 16, 2019, 4:25pm

Hello,
Thank you for the great new release that includes the classifier!
I got the results for important features heatmap, however, I get the full sequences instead the taxonomy of the features in the heatmap.qzv. Is it possible to insert the actual taxa names in this figure?

Thank you,
Best,
Lena

Mehrbod_Estaki · July 16, 2019, 6:56pm

Hi @LenaLapidot,
If you use the taxa collapse plugin to collapse your feature table down to whatever taxonomic level you want then those names should appear in your heatmap.
Keep us posted!

Nicholas_Bokulich · July 16, 2019, 7:08pm

Thanks for reminding me, @LenaLapidot. This is something I've been meaning to do this release.

I just made the changes to allow this, but my gut feeling is it will not make it into the 2019.7 release (out at the end of this month). Keep an eye on the changelog that comes out with that release, and/or this pull request to keep track of this feature.

In the meantime, here is a convoluted workaround (no guarantees it will work "as is"!):

export your taxonomy classifications
concatenate your feature IDs and taxonomy IDs and add this as an additional column to your taxonomy.tsv file, maybe something like this would work (or maybe not!): tr '\t' ': ' < taxonomy.tsv | paste taxonomy.tsv - > new-taxonomy.tsv
re-annotate your feature IDs. Use `feature-table group --i-table table.qza --p-axis feature --m-metadata-column 'feature-id: Taxon' --m-metadata-file new-taxonomy.tsv --o-grouped-table table-with-new-feature-ids.qza
use that new table to make your heatmap

Again, no guarantees this will work as-is! The goal is to get a taxonomy text file that looks like this:

feature-id	Taxon	feature-id: Taxon
feature1	Bacteria	feature1: Bacteria
feature2	Bacteria;Listeria	feature2: Bacteria;Listeria
et cetera

Then whatever is in the third column (which can be named whatever you like!) will be used to relabel your taxon IDs. As long as it is a unique value for each feature, you will keep the unique ASVs and just give them new names. So you can also try making this by hand (e.g., in excel) rather than following the nasty hack I've ad libbed above.

That would work but is probably not what you want, since you would potentially be collapsing individual ASVs that are important features in their own right (if your ASVs all have unique taxonomies, you could just follow my convoluted workaround starting at step 3 and using your taxonomy.qza file, no need to attempt the rusty hack of steps 1-2).

LenaLapidot · July 17, 2019, 5:41am

Thank you for the great explanation!
I'll try it in the next couple of hours and keep you posted

LenaLapidot · July 25, 2019, 9:09am

Ok, so I've tried both suggested solutions.
If I do collapse table it decreases the overall accuracy..

I've tried creating the table manually. I get this error:
Inputs:
--i-table ARTIFACT FeatureTable[Frequency]
The table to group samples or features on. [required]
Parameters:
--p-axis TEXT Choices('sample', 'feature')
Along which axis to group. Each ID in the given axis
must exist in metadata. [required]
--m-metadata-file METADATA
--m-metadata-column COLUMN MetadataColumn[Categorical]
A column defining the groups. Each unique value will
become a new ID for the table on the given axis.
[required]
--p-mode TEXT Choices('mean-ceiling', 'median-ceiling', 'sum')
How to combine samples or features within a group.
sum will sum the frequencies across all samples or
features within a group; mean-ceiling will take the
ceiling of the mean of these frequencies;
median-ceiling will take the ceiling of the median of
these frequencies. [required]
Outputs:
--o-grouped-table ARTIFACT FeatureTable[Frequency]
A table that has been grouped along the given axis.
IDs on that axis are replaced by values in the
metadata column. [required]
Miscellaneous:
--output-dir PATH Output unspecified results to a directory
--verbose / --quiet Display verbose output to stdout and/or stderr during
execution of this action. Or silence output if
execution is successful (silence is golden).
--citations Show citations and exit.
--help Show this message and exit.

                There was a problem with the command:

(1/1) Missing option "--p-mode".

LenaLapidot · July 25, 2019, 9:22am

I chose "--p-mode-sum".. than reran the classifier and the accuracy results dropped significantly...

Nicholas_Bokulich · July 25, 2019, 12:10pm

Yep, as I mentioned above:

in other words, you are losing information that is potentially diagnostic. In your case, it appears so! (which is information in its own right — that ASVs that share the same taxonomy are differentially abundant in different classes)

So give the workaround I listed above a try — or we shall see if this change makes it into next week's release. You may just need to install my development branch of q2-sample-classifier to get access to all these new features.

LenaLapidot · July 25, 2019, 12:33pm

I did the workaround and it seems to work!
Just to verify, when I ran feature-table group.. is it ok that I chose --p-mode-sum ?

Also, it seems there are no differences when I run the classifier with or without the optimize feature selection option. Does this make sense?
Thank you!

Nicholas_Bokulich · July 25, 2019, 12:43pm

Yes

Yes, that is normal. Having more information (i.e., all features) usually does not make the models perform worse — rather, optimizing feature selection is aiming to scrape away the uninformative features so that you can determine that a model with far fewer features is still just as accurate (or nearly as accurate) as the full model. Most estimators report feature importances anyway, and use that information in their predictions. Optimize feature selection is really just an automated way to have sample-classifier report what is the minimum set of features that maximize accuracy.

LenaLapidot · July 25, 2019, 1:06pm

Got it, thank you for the great explanation!
If I understand correctly, I can see the minimum set of features that maximize accuracy at the model summary.qzv?

Nicholas_Bokulich · July 25, 2019, 1:09pm

the model summary will show you a plot of the recursive feature elimination results (i.e., how model accuracy changes as a function of # of features), but not what those features are.

You can use the heatmap or metadata tabulate actions to see what the top important features actually are.

LenaLapidot · July 25, 2019, 1:14pm

Awesome, I understand it now.
Thanks again and have a great day.

system · August 25, 2019, 7:14pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.