Phylogenize error 1 in R code

pbradz · May 11, 2020, 9:40pm

Hi @Pablo_V, based on that I don't think diversity should be a huge problem for you. For read depth I think you can use the "summarize" command in the "feature-table" plugin as described here. Let me know how that works! -p

Pablo_V · May 11, 2020, 9:52pm

Hi @pbradz,

One question, is --p-pctmin the same as --p-minimum?

Also, the median depth of my data is 35K reads. So I guess it's not shallow, right?

pbradz · May 11, 2020, 10:02pm

Hi @Pablo_V, yes, for 16S data that should be plenty!

--p-pctmin and --p-minimum actually do pretty different things. The --p-pctmin parameter filters out trees without a certain number of species observed, while the --p-minimum parameter filters out genes that are nearly always present or absent. Specifically, if you set --p-minimum 3, then in order to be tested, any gene needs to be absent in at least 3 species and present in at least 3 species. The reason is that if a gene is near-universally present or absent, then we wouldn't have enough power to detect an association anyway. Make sense?

Pablo_V · May 23, 2020, 9:17pm

Hi @pbradz,

I have successfully run phylogenize! thanks a lot for your help.

However, I am still unsure on the interpretation of the results. What is the meaning of negative prevalence in the figures attached? And what is the unit of the y axis?

And regarding the phylogenetic tree, does the highlighted result mean that Paenibacillus is prevalent in 98% of the samples in this specific environment?

Thanks in advance!

Cheers,
Pablo

pbradz · May 23, 2020, 9:46pm

Hi @Pablo_V, great to hear it!

Good question. The prevalence density plots are actually on a logit scale on the x axis. Prevalence is essentially a probability (from 0 to 1) so to do the regression, phylogenize applies a logit transformation so that they look more like a normal distribution. To get the original prevalences, you apply the inverse logit, exp(x) / (exp(x) + 1). So, a value of -2.5 would equal (exp(-2.5) / (exp(-2.5) + 1)), or around 0.08.

These are density plots, like a continuous version of a histogram, so the y axis is density. I think the best way to think about it is that the area under the curve between two points on the x axis gives you the proportion of species that were in that prevalence range. In this case the graph is telling you that most but not all of the Proteobacteria were low-prevalence, at or near the limit of detection (which is based on the number of samples you have).

And yes, in this example, this Paenibacillus species was estimated to have a prevalence of 98.7% in the environment you selected. This could actually mean it was always detected: phylogenize uses pseudocounts to avoid values of 100% or 0%. This is for two related reasons: 1. those values are infinite after transformation and 2. it would probably be overconfident to say that a microbe is "always" or "never" detected. So phylogenize both uses and reports slightly moderated estimates of prevalence.

system · June 24, 2020, 3:53am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.