Hello, I'm working with Illumina short reads of fish environmental DNA (using 12S MiFish primers). I've successfully generated a de novo phylogenetic tree of OTU clusters using qiime phylogeny fasttree after generating a Mafft alignment and masking noisy positions. However, I'd like to include some reference sequences either using the mitohelper qza formatted reference database or by adding some reference sequences to my rep_seqs.qza file. Is there a way to achieve this? Or is this best achieved outside of qiime2?
From my understanding, I won't be able to use q2-fragment-insertion because there isn't a SeppReferenceDatabase available or validated for the reference database I'm using, according to the qiime2 forum post here.
Then using QIIME 2 View search (upper right of visualization) or scroll through the taxonomy presented in the 12S-tax-derep-uniq.qzv file. Write down a list of IDs that you'd like to use into a text file. We'll call it seq-ids-to-keep.txt. Using the following format:
Now you can simply merge your reference sequences (the amplicon region of the mitohelper database we extracted earlier) with your and OTUs/ESVs (here called my-otus.qza) with the merge command:
I've added one additional step, which is to merge the FeatureData[Taxonomy] associated with my OTUs with the FeatureData[Taxonomy] associated with the reference database:
I couldn't find a way to filter the FeatureData[Taxonomy] artifact of the reference database (12S-tax-derep-uniq.qza) prior to merging. I did find this post on the topic.
However, in the end it doesn't seem to matter because I am importing the tree, feature-table, and taxonomy into a phyloseq object in R to draw a tree with the plot_tree function. And according to phyloseq documentation: "OTUs and samples are included in the combined object only if they are present in all components. For instance, extra “leaves” on the tree will be trimmed off when that tree is added to a phyloseq object."
If you have any suggestions on how to circumvent this problem, I'd love to know. The end goal is to draw a phylogenetic tree of my OTUs that contains some reference sequences (with taxonomic labels or accession numbers for reference sequences).
The RESCRIPt plugin (not currently installed as part of QIIME 2, but installation instructions are available on the forum) has an action to filter a taxonomy based on a list of IDs or search term. See this tutorial:
RESCRIPt, by the way, could also be used to programmatically download reference sequences and taxonomies directly from NCBI based on an entrez search query... so that could also be an option if you only want to grab a limited number of accessions vs. all of mitofish.
And I agree on your point above "there should be programmatic ways of accomplishing the same thing." For this particular reference database, I've used a python tool called mitohelper for this purpose (functions get record and get alignment).
I didn't know about these functionalities of the RESCRIPt plugin! Thank you for this information. This looks like another viable option to obtain references & associated taxonomies.