Q2-picrust2: recommended max NSTI value

jairideout · March 26, 2019, 9:10pm

I'm using q2-picrust2's custom-tree-pipeline as described in the q2-picrust2 tutorial, and the default max-nsti cutoff described there is 2. I noticed that the Hidden state prediction page says "sequences with extremely high NSTI values (e.g. > 1) should be removed", and further down on that page: "There is no clear cut-off for a high NSTI values, but a good rule of thumb is that sequences placed with NSTI > 0.15 will be less reliable."

Is the default max-nsti=2 reasonable for human fecal samples, or is it better to use something more stringent as described on the HSP wiki page, such as max-nsti=1?

Thanks for your insight!

Mehrbod_Estaki · March 26, 2019, 10:26pm

I actually was questioning this as well last week and totally forgot to follow up on it. Thanks for the reminder!
Pinging q2-picrust's developer @gmdouglas for his input

gmdouglas · March 26, 2019, 11:18pm

Hey @jairideout and @Mehrbod_Estaki,

Sorry for the confusion - what was written on that page was from before I tested out a number of different NSTI cut-offs on different datasets. I'm now suggesting a max NSTI cut-off of 2, which should eliminate junk sequences essentially that can't be placed in the reference tree. I've changed that wiki page to reflect the difference.

That being said the choice of max NSTI had little impact on the concordance between 16S-predicted functions compared to metagenomics-identified functions (except when throwing out >90% of ASVs). This was true even for environmental samples, so using either cut-off should have very little impact on the metagenome-wide predicted function abundances.

Gavin

jairideout · March 27, 2019, 7:32pm

@gmdouglas Awesome, thanks for the details and for updating the docs!

Ashok_Kumar_Sharma · July 11, 2019, 5:29pm

In the qiime picrust2 command I am also using --p-max-nsti 2 (default value). As from the discussion it is clear that It will eliminate the junk sequences which can't be placed in to the phylogenetic tree. Now it's extremely important to know that how many sequences (%ASVs) have been used to predict the functions for a given dataset. Is there any option available for the users by which they can get this information.
It would be more precise to report the proportion of data used for the prediction.

Ashok

colinbrislawn · December 9, 2023, 4:35pm

An off-topic reply has been split into a new topic: What it means???? I am not sure about good or bad results??!!!

Please keep replies on-topic in the future.