Filter features conditionally not working as intended

I believe the filter-features-conditionally plugin does not work as intended in the description. The description is understood that both conditions (abundance and prevalence) have to be failed for a feature to be removed. But from my testing, a feature will be removed if it fails prevalence alone, irrespective of abundance.

Here is an example where the control sample (right edge bar on barplot) has 4 high abundance features and prevalence is set to 1% (and abundance at 0.5%).

Here is the same dataset with prevalence set to 5% (and abundance at 0.5%) where two high abundance features were removed from the control sample (right edge bar in the barplot)

Hi @Alxdu,

I think are is maybe a miscommunication here, but when I go in and check the code, the function is behaving as expected.

I think first is clarifying the expected behavior.

The abundance criteria must be met in at least proportion number of samples. (So, of there are n samples with abundance, a and prevalence, p, the function retains features where \frac{1}{n}\sum_i^n(a_{i} > a) ≥ p. (If you want a specific condition for prevalence, you can set a=0.) This is exactly how it's reported in the docs, if you specifically look at prevalence.

From the function description in the docs:

Filter features based on the relative abundance in a certain portion of samples (i.e., features must have a relative abundance of at least abundance in at least prevalence number of samples). Any samples with a frequency of zero after feature filtering will also be removed.

So, both criteria must be met, but they must be met under those specific conditions. (If you want a joint filtering where both conditions must be met, filter-features may be a better function for you.

In your specific case, the visualization you're using hampers determining performance. My experience has been that stacked barcharts work best when you have features that have a relative abundance of at least 1% in the samples you're visualizing and really only allow you to visualize 12 groups. In this particular case, the ability to make the diagnosis is hampered by the large number of repeating colors, the lack of a legend, possibly collapsed data (I'm guessing genus or family) and not having the interactive ability to mouse over and see which colors correlate across samples. Add to this that it's again, really hard to see low abundance feautures (my visualization threshhold is usually 1%) picking things out is difficult.

If you want to verify the function is working correctly, I would recommend either changing your visualization approach or increasing your abundance threshold.


1 Like

Justine. Thank you for the feedback. I have a better understanding on how the conditions are set.

I am attaching the same figures where the four features highligthed (A, B, C, D from top to bottom). The control sample (a mix of four species) is on the left now. You will notice that features B and C are present only in control and prevalence is set to 1% (i.e, 30 samples * 0.01 = 0.3) and abundance at 0.5%. Taxa are not collapsed, barplot is visualized at rank 7.

And here is the same data where prevalence was set to 5% (30 samplex * 0.05 = 1.5)

So, features B and C fail the conditional prevalence here, but I expected them to still be retained because they were above the abundance cut-off.

Thanks for the feedback. I understand the explanation you provided, but it's not what I expected going from the plugin description. Does my explanation make sense to you? Is that a functionality that can be implemented, such that niche features (i.e., contaminants) which may be highly abundant with very low prevalence can still be retained, while low abundance and low prevalence noise can still be removed.

I think this might be a point of confusion: viewing your features as taxons is collapsing them --- you might have more than one feature that was classified as the same species/genus/etc.

What would be more helpful here is a feature-table heatmap of the table pre- and post-filtering.

1 Like

Negative. There are no features below rank 7. Data was imported at feature-table level, where anything below rank 7 was already collapsed at that node. As far as qiime can see, rank 7 is the lowest feature classification level. (i.e., no tax ids below species rank).
Thank you for the feedback, it is a valid observation if that was the case.

Hi @Alxdu,

I think your high abundance/low prevelance is addressed in the filter-features function I linked above; most low prevalence taxa are also low abundance. Depending on your sequencing depth, etc you could easily select for features that are abundant in a single sample as long as they have enough reads.

If you're looking for contaminat filtering specifically, I recommend looking into decontam and the forum discussion on decontamination. It's a more complex topic than what you're describing and I think that's a good area to start.