Questions about sample-classifier

I have a few questions about sample-classifier
(1) I have run "qiime sample-classifier classify-samples", but do not understand the results. Could anybody help me interpret the results?
Here is my command:
qiime sample-classifier classify-samples
--i-table BI-feature-table-c10l60.qza
--m-metadata-file metadata4_BI.tsv
--m-metadata-category distance
--output-dir BI-SML

I have run similar commands a few times for different feature tables and obtained different Accuracy Ratios. One result is attached. visualization.qzv (261.9 KB)

(2) I also run maturity-index and regress-samples, but encountered errors.
qiime sample-classifier maturity-index
--i-table BI-feature-table-c10l60.qza
--m-metadata-file metadata4_BI.tsv
--p-category distance
--p-group-by distance
--p-control D500
--output-dir BI-maturity-index

Plugin error from sample-classifier:

Cannot have number of splits n_splits=5 greater than the number of
samples: 4.

qiime sample-classifier regress-samples
--i-table BI-feature-table-c10l60.qza
--m-metadata-file metadata4_BI.tsv
--m-metadata-category distance
--output-dir BI-regress

Plugin error from sample-classifier:

could not convert string to float: 'D090'

For you information, there are 42 samples (7 distances X 6 samples/distance) in the feature table. The 7 distances are D000, D015, D030, D060, D090, D125 and D500.


Hi @eDNA,
Thank you for posting your questions.

The description of the outputs is given in the docs (scroll down below the command block). Basically, the heatmap (and confusion matrix table below it) is going to show the fraction of times that samples in each group were assigned the each group. See more details here.

That is absolutely normal — there are a number of random factors at play here, in subsampling training/test samples, in subsampling features, and in training the models. Small amounts of variation are normal and fine. You could supply a random number seed with --p-random-state to get replicable results.

These methods are not appropriate for the data types that you are providing. Regression methods only accept numerical --p-category data, and the distance metadata column that you are providing is not numerical. If appropriate, you could convert these to numerical values (e.g., if 0, 15, 30, etc, indicate meters or some other units of distance from something).

Furthermore, it looks like maturity-index is inappropriate for the data you are using. I recommend reading over the docs and the original paper for the maturity-index method to understand what this method measures. category and group-by should be different metadata columns; e.g., category will usually be a measure of time (distance might also work) and group-by should be, e.g., two different treatment groups that are measured at the same points described in category. This method will determine whether group-by groups experience different trajectories of change over change in category.

I hope that helps! Let us know if you have further questions.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.