Q2-perc-norm - reduction in taxa

Hind_Sbihi · November 28, 2018, 12:17am

In reference to my previous question, I should let you know that the issue was just the quotes!

Adding quote=F will solve the problem

I successfully ran your plugin
Now, the issue is that

number of sOTUs reduce from ~300 to ~30

This ten-fold reduction in the number of taxa that go in and come out is huge.
It is possibly due to the fact with the data that was processed, I had to filter such that OTU prevalence is at least in 5% of my samples.

Can I reduce the threshold in your plugin

I'd like the OTU frequency threshold to be at 1% or 5%

I suppose I could try to play with the python code but I would not know where to start to get this working in the plugin.

Thank you very much for your help!!

Hind

cduvallet · November 28, 2018, 6:02pm

Glad to hear that it is working now!

I can add the threshold as a parameter to input in the plugin today or tomorrow - it should be fairly straightforward to do. I'll reply on this topic when that new version is out.

While you could reduce the threshold to allow OTUs which are only present in 1-5% of your samples, this is actually highly inadvisable. The percentile normalization method converts OTU relative abundances to their relative rank across all controls. If 95% of your samples have a zero abundance for a given OTU, then 95% of your samples will have the same rank. In reality, we actually add noise to zero values in order to prevent ties in the ranks. That means that in the best case scenario, these OTUs will just be converted to random noise. In the worst case scenario, you will find spurious results because too many of your samples were zero for too many of your OTUs.

I wrote a blog post explaining the issue with zeros, published on microBEnet and my own website - hopefully it can clarify more why we don't recommend using this method on very sparse OTUs.

Also CCing @seangibbons in case he has anything to add.

cduvallet · November 28, 2018, 6:56pm

I actually just realized that this functionality is already in the plugin! You can change the threshold for filtering OTUs with the --p-otu-thresh parameter. This parameter is used as follows: an OTU must be present in at least X fraction of cases or controls to be retained in the analysis (where X is a float between 0 and 1 given to --p-otu-thresh). So you should be able to modulate this filtering behavior with the current version of the plugin. Let me know if it's not working though.

This functionality happens in line 42 of the _percentile_normalize.py file.

seangibbons · November 28, 2018, 7:11pm

Completely agree with everything Claire said

Based on an email exchange from yesterday, I think Hind is interested in generating an ordination plot from the percentile-normalized data. The current implementation of the script is not optimized for this purpose, but rather for differential abundance testing. Including the low-occurrence OTUs is not productive for differential abundance testing (e.g. it increases your multiple test correction penalty). However, by excluding the taxa that show up as mostly zeros you'll actually introduce batch-effects into an ordination of the percentile-normalized data (i.e. the taxa that show up as zeros are very different across batches). Thus, if you want to make a PCoA or something, it's best to not remove any of the low-occurrence taxa (just include them as noise).

cduvallet · November 30, 2018, 8:41pm

Okay, just as a bit of added clarification after some offline discussion with Sean for anyone who stumbles on this thread later:

From @seangibbons: "The reason that these rare taxa may "drive" batch effects in ordination plots is that if you run the percentile normalization separately in each study (as you should), then each study will have its own unique signature of OTUs that get removed. 95% of your data for OTU1 might be zeros in study A, but it might only be 30% in study B. Then, when you pool many studies, the zeros drive a strong study-specific clustering (i.e. OTU1 appears to not have been detected in study A, but is present in Study B). This will look like ‘batch effects’."

I'm personally not sure what the right way to approach this is, but two options are to not do any filtering (as @Hind_Sbihi proposed doing), or to plot the ordination plot based only on taxa that were maintained in all studies. Either way, you'll definitely need to be careful that batch-confounded patterns of OTU presence/absence are not driving any signal that you see, since this might still be the case. And, we all agree that no quantitative analyses (like differential abundance testing) should be done on these rare taxa.

Hind_Sbihi · December 5, 2018, 6:15pm

@cduvallet and @seangibbons: THANK YOU BOTH!! Your explanations make a lot of sense
After playing with the threshold level and going as low as 1% (yikes!), we have decided to run our analyses without percentile normalization. We adjust for batch in subsequent statistical analysis (e.g. DeSeq2, edgeR) when examining what species are driving differences between our healthy vs. non-healthy groups.
Again, thanks so much for taking the time

seangibbons · December 5, 2018, 6:36pm

Sounds good. But rest assured, despite the 'batchiness' in ordination space, your data has still been successfully batch-corrected from the standpoint of differential abundance tests. The ordination plot isn't important for this correction. It's just a quirk/artifact of how the data are filtered that drives this. So if you're worried that the method isn't working correctly for your differential abundance tests, then rest assured that you can still move forward with this type of analysis. You'll probably be fine with DeSeq2, especially if the difference you're looking for is slight (DeSeq2 assumes that most things in your dataset are not changing - if the difference between case and control involves a global shift in the system, then DeSeq2's assumptions are violated). Good luck!