Trying to link the OTU Numbers from feature table (biom file) with the Feature ID (#q2:types)

dheeman00 · September 21, 2018, 8:29pm

Hi Qiime 2 Form members

Thank you for all the support. I am very close to close up this project. But I am not getting a clear picture of some of the analysis I have done. If someone from the form explains to me how this particular portion is working that will be really helpful.

So here is the issue:

  BIOM.Files       rows columns
  Unclustered      3765      72
  1% Clustered     2485      72
  85% Clustered    2474      72
  90% Clustered    2446      72
  95% Clustered    2320      72
  97% Clustered    2166      72
  100% Clustered    623      72

I had used the following command to generate the clustering files:

#then perform clustering for PiCrust

qiime vsearch cluster-features-closed-reference
–i-sequences Final_Aug_2018/MiSeq_data_Aug18_V1_rep_seq_min_8_freq_min_3_samples.qza
–i-table Final_Aug_2018/MiSeq_data_Aug18_V1_table_min_8_freq_min_3_samples.qza
–i-reference-sequences gg_13_5_otu_99.qza
–p-perc-identity 1
–p-threads 0
–output-dir Final_Aug_2018/PICRUST_table_min_8_freq_min_3_samples

I have tuned the “–p-perc-identity” the parameter to generate the different cluster files. I am very much confused with the outcome. I was expecting that when the value of that parameter is 1% the number of Feature IDs will be similar to that of the unclustered IDs. Is there any default thresholding values under that command? I have seen the help portion and it stated that the range of values will between 0 and 1.

Secondly, when the value is 100% the IDs reduced a lot when compared with the outcome of 97%. Could anyone give me a proper explanation about these changes?

Nicholas_Bokulich · September 21, 2018, 9:19pm

Closed-reference OTU clustering is really just finding the top match that has within X % similarity to your query sequences. If none are found within X % similarity, that query sequence is dropped.

So you see the # of OTUs drop as % id increases because fewer sequences match the reference sequences with at least that much similarity.

100% percent identity causes a dramatic decrease because you are telling the program to drop any sequence that does not have an exact match in the reference.

1% OTU clustering does not make any practical sense. But it does make sense why the number of OTUs ≠ the number of unclustered sequences. You are setting a very low clustering threshold, but it is still choosing the top hit as the representative OTU — many query sequences may have the same reference sequence match, so will still be clustered to the same OTU even if the percent identity is very low.

I hope that helps!

dheeman00 · September 22, 2018, 2:27pm

Hi @Nicholas_Bokulich

Thank you for the clarification. I am an understanding that 1% clustering does not make sense regarding clustering logic. But if I am setting a very low clustering value then how does the optimal clustering value is determined.

Nicholas_Bokulich · September 24, 2018, 4:12pm

The optimal clustering value is going to be subjective for this specific case. How do you define optimal and how do you evaluate that? In the case of closed-reference clustering prior to picrust, you do not want to set a low clustering threshold because then you will admit sequences that do not have a close match to reference sequences. That defeats the purpose of picrust, which requires relatively close matches so that a 16S sequence can be reliably linked to a full genome sequence.

But at this point this is really turning into a picrust issue — I recommend contacting the picrust developers if you have more questions about optimal clustering settings prior to picrust.

system · October 25, 2018, 10:23pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.