The reference sequences used in open-reference clustering of features.


I want to cluster the results from DADA2 at the identity of 0.97, the option --i-reference-sequences asks me to provide reference sequences. I’m confused by the types of reference sequences.

I think there are four types of reference sequences:

The first two types are the full-lengh reference sequences in Qiime version of Silva database. For example, the silva_132_99_16S.fna and silva_132_97_16S.fna.

The other reference sequences are generated by extracting the above full-length reference sequences through my own primers, which is similar to training classifiers. For example, only retain the specific region (V4) of full-length reference sequences.

Which one do you recommend?I think the full-length silva_132_97_16S.fna is ok.

Merry christmas and happy new year 2020!



Hi @nmgduan,
The four types of reference sequences you describe are all indeed reference sequences that you can use for open-ref OTU clustering. It all depends on your goals:

The 97 and 99 indicate the similarity threshold used for clustering the raw reference sequences into OTUs. For open-ref OTU clustering of your sequences, you should probably use the same % id for both the reference and query sequences — i.e., if you plan to cluster your sequences into 97% otus, then the 97% ref seqs are fine, but if you plan to cluster at 99%, then the 97% reference otus may be lacking some detail.

You do not need to trim the reference sequences used for OTU clustering, though it will increase speed.

I hope that helps!

Thanks for your prompt reply!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.