Using Q2 for functional gene analysis

Hi @Analissa_Sarno,
Thanks for posting your question!

You can follow the same steps that you use for your query sequences, i.e., dereplicate with vsearch dereplicate-sequences. The outputs from that action can be input to cluster-features-de-novo.

Do you really want to use OTU picking for any of these, though? It seems to me that if you have a small and well-curated reference database, you probably would just want to dereplicate the sequences — though any kind of dereplication/clustering will create issues because you will need to decide how to handle the taxonomic/other annotations (e.g., clustered seqs may have conflicting annotations). The simplest thing to do is probably just use the reference database as-is for taxonomy classification, without any clustering. You can get an idea of the difficulty involved in reformatting a reference database in this forum post, and such steps are not really supported in QIIME2 (currently, at least).

Similarly, do you want to use OTU clustering for your query sequences? Since you are working with functional genes, the OTU cluster thresholds don't really hold much meaning ("operational taxonomic units") and you are probably interested in SNPs... so you should probably be running with actual sequence variants. The dereplicated sequences are probably enough, since you are using other QC methods (and I assume these are specific to these functional genes so the QC-esque properties of OTU picking are probably unnecessary).

If you want to go further with this and are concerned that, e.g., residual sequence error may be present in your query seqs, I would recommend giving dada2 or a deblur a try for resolving ASVs.

Sounds like you have already imported your reference database into FeatureData[Taxonomy] and FeatureData[Sequence] artifacts. These can be used directly for any of the classification methods in q2-feature-classifier. E.g., you could train a new feature classifier to classify sequences with classify-sklearn, or you could use classify-consensus-blast or classify-consensus-vsearch for alignment-based methods.

Any of these should work for your sequences, but the alignment-based methods are probably more commonly used for functional gene classifications — it's worth giving both a shot and comparing the results (I usually warn against these comparisons since you don't really know which is "best" unless if you are testing on datasets with known compositions, but with that caveat it is worth taking a look! :wink: ).

I hope that helps!