I’m using q2-picrust2’s
custom-tree-pipeline as described in the q2-picrust2 tutorial, and the default
max-nsti cutoff described there is
2. I noticed that the Hidden state prediction page says “sequences with extremely high NSTI values (e.g. > 1) should be removed”, and further down on that page: “There is no clear cut-off for a high NSTI values, but a good rule of thumb is that sequences placed with NSTI > 0.15 will be less reliable.”
Is the default
max-nsti=2 reasonable for human fecal samples, or is it better to use something more stringent as described on the HSP wiki page, such as
Thanks for your insight!
I actually was questioning this as well last week and totally forgot to follow up on it. Thanks for the reminder!
Pinging q2-picrust’s developer @gmdouglas for his input
Hey @jairideout and @Mehrbod_Estaki,
Sorry for the confusion - what was written on that page was from before I tested out a number of different NSTI cut-offs on different datasets. I’m now suggesting a max NSTI cut-off of 2, which should eliminate junk sequences essentially that can’t be placed in the reference tree. I’ve changed that wiki page to reflect the difference.
That being said the choice of max NSTI had little impact on the concordance between 16S-predicted functions compared to metagenomics-identified functions (except when throwing out >90% of ASVs). This was true even for environmental samples, so using either cut-off should have very little impact on the metagenome-wide predicted function abundances.
@gmdouglas Awesome, thanks for the details and for updating the docs!
In the qiime picrust2 command I am also using --p-max-nsti 2 (default value). As from the discussion it is clear that It will eliminate the junk sequences which can’t be placed in to the phylogenetic tree. Now it’s extremely important to know that how many sequences (%ASVs) have been used to predict the functions for a given dataset. Is there any option available for the users by which they can get this information.
It would be more precise to report the proportion of data used for the prediction.