Thank you for all the support. I am very close to close up this project. But I am not getting a clear picture of some of the analysis I have done. If someone from the form explains to me how this particular portion is working that will be really helpful.
I have tuned the “–p-perc-identity” the parameter to generate the different cluster files. I am very much confused with the outcome. I was expecting that when the value of that parameter is 1% the number of Feature IDs will be similar to that of the unclustered IDs. Is there any default thresholding values under that command? I have seen the help portion and it stated that the range of values will between 0 and 1.
Secondly, when the value is 100% the IDs reduced a lot when compared with the outcome of 97%. Could anyone give me a proper explanation about these changes?
Closed-reference OTU clustering is really just finding the top match that has within X % similarity to your query sequences. If none are found within X % similarity, that query sequence is dropped.
So you see the # of OTUs drop as % id increases because fewer sequences match the reference sequences with at least that much similarity.
100% percent identity causes a dramatic decrease because you are telling the program to drop any sequence that does not have an exact match in the reference.
1% OTU clustering does not make any practical sense. But it does make sense why the number of OTUs ≠ the number of unclustered sequences. You are setting a very low clustering threshold, but it is still choosing the top hit as the representative OTU — many query sequences may have the same reference sequence match, so will still be clustered to the same OTU even if the percent identity is very low.
Thank you for the clarification. I am an understanding that 1% clustering does not make sense regarding clustering logic. But if I am setting a very low clustering value then how does the optimal clustering value is determined.
The optimal clustering value is going to be subjective for this specific case. How do you define optimal and how do you evaluate that? In the case of closed-reference clustering prior to picrust, you do not want to set a low clustering threshold because then you will admit sequences that do not have a close match to reference sequences. That defeats the purpose of picrust, which requires relatively close matches so that a 16S sequence can be reliably linked to a full genome sequence.
But at this point this is really turning into a picrust issue — I recommend contacting the picrust developers if you have more questions about optimal clustering settings prior to picrust.