Heat Map Variables Sequences Do Not Sum to Same Amount

Good day,

I have generated a heat map comparing OTUs between different sources:

My professor is asking why qqq's total number of sequences doesn't add up to the same amount as abc and xyz. I'm not sure how to explain to him why this is.

I believe this has to do with the importance scores. I have many more samples for abc and xyz in this dataset, with very few for qqq. Additionally, qqq has a very low diversity compared to sources abc and xyz. Therefore, there shouldn't be many OTUs with high abundance from qqq because these samples contribute less to the dataset's importance score calculations.

Is this correct? My professor suggested I rarefy the data so that each sample uses the same number of sequences, as a higher number of sequences in abc's and xyz's samples might be contributing to the higher amounts of sequences observed in the heatmap. After rarefying, my output was basically the same.

I used the following commands to generate the heat map:

qiime sample-classifier classify-samples \
--i-table taxa-levels/table-l7.qza \
--m-metadata-file Metadata_File.txt \
--m-metadata-column source \
--p-random-state 666 \
--p-n-jobs 1 \
--output-dir machine_learning_classifier/sample-classifier-results/


qiime sample-classifier heatmap \
--i-table taxa-levels/table-l7.qza \
--i-importance machine_learning_classifier/sample-classifier-results/feature_importance.qza \
--m-sample-metadata-file Metadata_File.txt \
--m-sample-metadata-column source \
--p-group-samples \
--p-feature-count 30 \
--p-color-scheme Greens \
--o-heatmap machine_learning_classifier/sample-classifier-results/heatmap-by-source_Greens.qzv \
--o-filtered-table machine_learning_classifier/sample-classifier-results/heatmap-by-source.qza

I'd appreciate anyone's comments or feedback on how to explain why qqq's total number of sequences are not equivalent to abc and xyz's. Also, does anyone have any resources on how importance scores are evaluated? I searched the web but couldn't find anything specific.

Thank you so much.

1 Like

Hi @Chantel ,
The heatamps output by q2-sample-classifier only display the top N most important features in the heatmap, not all features, so:

Because it is picking out only the most important features, so even if the total sequence counts for all samples are the same going in, the counts for individual features will likely be different, so will not have the same sum in a subset of features.

This is in part for the same reason as above; because after rarefying you have the same number of sequences in each sample, but not for each feature. So the subset of important features will still not sum to the same amount across all samples.

Random Forests is also rather robust to sequence counts, so unless if you have very skewed sequencing depths across samples I would not expect rarefying to impact the results.

This post gives a nice explanation, and also a link for where to learn more about the algorithms used:

Good luck!