I have generated a heat map comparing OTUs between different sources:
My professor is asking why qqq's total number of sequences doesn't add up to the same amount as abc and xyz. I'm not sure how to explain to him why this is.
I believe this has to do with the importance scores. I have many more samples for abc and xyz in this dataset, with very few for qqq. Additionally, qqq has a very low diversity compared to sources abc and xyz. Therefore, there shouldn't be many OTUs with high abundance from qqq because these samples contribute less to the dataset's importance score calculations.
Is this correct? My professor suggested I rarefy the data so that each sample uses the same number of sequences, as a higher number of sequences in abc's and xyz's samples might be contributing to the higher amounts of sequences observed in the heatmap. After rarefying, my output was basically the same.
I used the following commands to generate the heat map:
qiime sample-classifier classify-samples \ --i-table taxa-levels/table-l7.qza \ --m-metadata-file Metadata_File.txt \ --m-metadata-column source \ --p-random-state 666 \ --p-n-jobs 1 \ --output-dir machine_learning_classifier/sample-classifier-results/
qiime sample-classifier heatmap \ --i-table taxa-levels/table-l7.qza \ --i-importance machine_learning_classifier/sample-classifier-results/feature_importance.qza \ --m-sample-metadata-file Metadata_File.txt \ --m-sample-metadata-column source \ --p-group-samples \ --p-feature-count 30 \ --p-color-scheme Greens \ --o-heatmap machine_learning_classifier/sample-classifier-results/heatmap-by-source_Greens.qzv \ --o-filtered-table machine_learning_classifier/sample-classifier-results/heatmap-by-source.qza
I'd appreciate anyone's comments or feedback on how to explain why qqq's total number of sequences are not equivalent to abc and xyz's. Also, does anyone have any resources on how importance scores are evaluated? I searched the web but couldn't find anything specific.
Thank you so much.