I previously processed an amplicon sequencing dataset using qiime vsearch cluster-features-open-reference with the UNITE database as the reference, clustering sequences at 97% similarity.
After clustering, I noticed that many of the resulting features have IDs that do not correspond to any UNITE reference sequence, for example:
002e85f000a392b796df1ac132abcf3ea2ca6c57
I now have a second high-throughput sequencing dataset that I would like to compare with the first one. My question is about the interpretation and stability of these non-reference feature IDs:
-
Are these feature IDs randomly generated for each run?
-
Or, if I run
cluster-features-open-referenceon a different dataset but use the same reference database and parameters, will the same feature ID always represent the same underlying sequence (or OTU)? -
In other words, are these IDs deterministic and reproducible across datasets, or are they dataset-specific?
Understanding this is important for me to decide whether feature IDs produced by open-reference clustering can be directly compared across independently processed datasets, or whether I need to explicitly merge representative sequences or reference files to ensure consistent OTU identities.
Any clarification on how these feature IDs are generated and how they should be interpreted would be greatly appreciated.
Thank you!