How are feature IDs generated in cluster-features-open-reference for non-reference OTUs?

I previously processed an amplicon sequencing dataset using qiime vsearch cluster-features-open-reference with the UNITE database as the reference, clustering sequences at 97% similarity.

After clustering, I noticed that many of the resulting features have IDs that do not correspond to any UNITE reference sequence, for example:

002e85f000a392b796df1ac132abcf3ea2ca6c57

I now have a second high-throughput sequencing dataset that I would like to compare with the first one. My question is about the interpretation and stability of these non-reference feature IDs:

  • Are these feature IDs randomly generated for each run?

  • Or, if I run cluster-features-open-reference on a different dataset but use the same reference database and parameters, will the same feature ID always represent the same underlying sequence (or OTU)?

  • In other words, are these IDs deterministic and reproducible across datasets, or are they dataset-specific?

Understanding this is important for me to decide whether feature IDs produced by open-reference clustering can be directly compared across independently processed datasets, or whether I need to explicitly merge representative sequences or reference files to ensure consistent OTU identities.

Any clarification on how these feature IDs are generated and how they should be interpreted would be greatly appreciated.

Thank you!

1 Like

Hello!

Feature IDs in qiime2 are MD5 hashes of the sequences, meaning that the same sequence will have the same ID across different studies.

If you get absolutely the same representative sequences for OTUs across different studies, they will have the same ID. But even one nucleotide difference will lead to different IDs.

IDs are reproducible among different studies. Not so sure about clustering - it can be that highly similar sequences from different studies, otherwise clustered into one OTU (when clustered together), will be clustered into slightly different OTUs when clustered within each study separately. Why not use ASVs instead (Dada2 and cutadaps settings should be identical)?

Hope that helps.

Best,

2 Likes

I'm just jumping in to echo this point from @timanix - if you want the ids to be consistent, your features should be ASVs, not clustered OTUs.

2 Likes

I’ll try it. Thank you very much!!!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.