Using Q2 for functional gene analysis

Analissa_Sarno · January 3, 2018, 3:37pm

Hi Q2 team!

I’m trying to use Q2 workflow to analyze functional gene amplicon sequencing data (Illumina MiSeq reads). This is what I have put together so far based on related posts (thank you for detailed postings):

I merged my data in Casper and used an in house script to demultiplex, trim and quality filter.
Import combined_seqs.fna with qiime tools import
Dereplicate using qiime vsearch dereplicate-sequences

Here is where the problem beings:
There is a reference database (albeit small but it has recently been cited in the literature several times in the past year). How do I format the database to be used in open-reference clustering? Currently the database is in two files, a FeatureData[Taxonomy] and FeatureData[Sequence].

I’m also starting to think about the classification steps.
Thank you for your help!

Nicholas_Bokulich · January 4, 2018, 4:39pm

Hi @Analissa_Sarno,
Thanks for posting your question!

You can follow the same steps that you use for your query sequences, i.e., dereplicate with vsearch dereplicate-sequences. The outputs from that action can be input to cluster-features-de-novo.

Do you really want to use OTU picking for any of these, though? It seems to me that if you have a small and well-curated reference database, you probably would just want to dereplicate the sequences — though any kind of dereplication/clustering will create issues because you will need to decide how to handle the taxonomic/other annotations (e.g., clustered seqs may have conflicting annotations). The simplest thing to do is probably just use the reference database as-is for taxonomy classification, without any clustering. You can get an idea of the difficulty involved in reformatting a reference database in this forum post, and such steps are not really supported in QIIME2 (currently, at least).

Similarly, do you want to use OTU clustering for your query sequences? Since you are working with functional genes, the OTU cluster thresholds don't really hold much meaning ("operational taxonomic units") and you are probably interested in SNPs... so you should probably be running with actual sequence variants. The dereplicated sequences are probably enough, since you are using other QC methods (and I assume these are specific to these functional genes so the QC-esque properties of OTU picking are probably unnecessary).

If you want to go further with this and are concerned that, e.g., residual sequence error may be present in your query seqs, I would recommend giving dada2 or a deblur a try for resolving ASVs.

Sounds like you have already imported your reference database into FeatureData[Taxonomy] and FeatureData[Sequence] artifacts. These can be used directly for any of the classification methods in q2-feature-classifier. E.g., you could train a new feature classifier to classify sequences with classify-sklearn, or you could use classify-consensus-blast or classify-consensus-vsearch for alignment-based methods.

Any of these should work for your sequences, but the alignment-based methods are probably more commonly used for functional gene classifications — it's worth giving both a shot and comparing the results (I usually warn against these comparisons since you don't really know which is "best" unless if you are testing on datasets with known compositions, but with that caveat it is worth taking a look! ).

I hope that helps!

system · February 4, 2018, 10:44pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.