I want to cluster the results from DADA2 at the identity of 0.97, the option --i-reference-sequences asks me to provide reference sequences. I’m confused by the types of reference sequences.
I think there are four types of reference sequences:
The first two types are the full-lengh reference sequences in Qiime version of Silva database. For example, the silva_132_99_16S.fna and silva_132_97_16S.fna.
The other reference sequences are generated by extracting the above full-length reference sequences through my own primers, which is similar to training classifiers. For example, only retain the specific region (V4) of full-length reference sequences.
Which one do you recommend？I think the full-length silva_132_97_16S.fna is ok.
The four types of reference sequences you describe are all indeed reference sequences that you can use for open-ref OTU clustering. It all depends on your goals:
The 97 and 99 indicate the similarity threshold used for clustering the raw reference sequences into OTUs. For open-ref OTU clustering of your sequences, you should probably use the same % id for both the reference and query sequences — i.e., if you plan to cluster your sequences into 97% otus, then the 97% ref seqs are fine, but if you plan to cluster at 99%, then the 97% reference otus may be lacking some detail.
You do not need to trim the reference sequences used for OTU clustering, though it will increase speed.