Trying to link the OTU Numbers from feature table (biom file) with the Feature ID (#q2:types)

Hi Qiime 2 Form members

Thank you for all the support. I am very close to close up this project. But I am not getting a clear picture of some of the analysis I have done. If someone from the form explains to me how this particular portion is working that will be really helpful.

So here is the issue:

  BIOM.Files       rows columns
  Unclustered      3765      72
  1% Clustered     2485      72
  85% Clustered    2474      72
  90% Clustered    2446      72
  95% Clustered    2320      72
  97% Clustered    2166      72
  100% Clustered    623      72

I had used the following command to generate the clustering files:

#then perform clustering for PiCrust

qiime vsearch cluster-features-closed-reference
–i-sequences Final_Aug_2018/MiSeq_data_Aug18_V1_rep_seq_min_8_freq_min_3_samples.qza
–i-table Final_Aug_2018/MiSeq_data_Aug18_V1_table_min_8_freq_min_3_samples.qza
–i-reference-sequences gg_13_5_otu_99.qza
–p-perc-identity 1
–p-threads 0
–output-dir Final_Aug_2018/PICRUST_table_min_8_freq_min_3_samples

I have tuned the “–p-perc-identity” the parameter to generate the different cluster files. I am very much confused with the outcome. I was expecting that when the value of that parameter is 1% the number of Feature IDs will be similar to that of the unclustered IDs. Is there any default thresholding values under that command? I have seen the help portion and it stated that the range of values will between 0 and 1.

Secondly, when the value is 100% the IDs reduced a lot when compared with the outcome of 97%. Could anyone give me a proper explanation about these changes?

Closed-reference OTU clustering is really just finding the top match that has within X % similarity to your query sequences. If none are found within X % similarity, that query sequence is dropped.

So you see the # of OTUs drop as % id increases because fewer sequences match the reference sequences with at least that much similarity.

100% percent identity causes a dramatic decrease because you are telling the program to drop any sequence that does not have an exact match in the reference.

1% OTU clustering does not make any practical sense. But it does make sense why the number of OTUs ≠ the number of unclustered sequences. You are setting a very low clustering threshold, but it is still choosing the top hit as the representative OTU — many query sequences may have the same reference sequence match, so will still be clustered to the same OTU even if the percent identity is very low.

I hope that helps!

1 Like

Hi @Nicholas_Bokulich

Thank you for the clarification. I am an understanding that 1% clustering does not make sense regarding clustering logic. But if I am setting a very low clustering value then how does the optimal clustering value is determined.

The optimal clustering value is going to be subjective for this specific case. How do you define optimal and how do you evaluate that? In the case of closed-reference clustering prior to picrust, you do not want to set a low clustering threshold because then you will admit sequences that do not have a close match to reference sequences. That defeats the purpose of picrust, which requires relatively close matches so that a 16S sequence can be reliably linked to a full genome sequence.

But at this point this is really turning into a picrust issue — I recommend contacting the picrust developers if you have more questions about optimal clustering settings prior to picrust.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.