Will qiime2 support functional gene analysis in the future?

sixvable · August 18, 2019, 1:45pm

Will qiime2 support functional gene analysis in the future?
I could not find any useful tutorial for functional gene analysis.

And Fungene Pipeline is not really useful because it could not installed easily.

colinbrislawn · August 19, 2019, 2:05pm

Qiime2 is not specific to 16S! It can work on any gene targeted with PCR right now, including functional genes.

Here's Nick's summary of the process:

So basically, you can follow the same pipeline for 16S amplicons, and just a custom database at the taxonomy assignment step.

Colin

mcreyno2 · August 19, 2019, 3:42pm

Thank you @sixvable for initiating a good discussion!

Dear @colinbrislawn,

I am a user with functional gene datasets also. I am still a novice in in regards to these types of analyses relative to 16S so please forgive me if some things mentioned might not be 100% accurate/sound. Tutorials for this kind of work are not as plentiful and useful as q2 docs/forum!

...

Thanks for directing both myself and @sixvable to Nick B.'s answer regarding running functional gene datasets on qiime2 platform.The idea of running against your own database does indeed make sense, however I have further clarifications to address below...

I'd be interested in yours or @Nicholas_Bokulich's thoughts on whether a given functional gene's (e.g. mcrA) manually curated database should be in nucleotide or amino acid format? (I'm interested in your response from a "qiime2 framework perspective", primarily).

I think the answer is clear... one would like to ideally classify functional gene reads at amino acid level since often tools to correct for frameshifts from sequencing errors are used upstream (Framebot) and this converts seqs to amino acid level. Yet, last I checked, I do not believe Qiime2 is capable of reading amino acids codes. This is kind of unfortunate since I'd be very interested in using naive-bayes classifer, for example.

It seems in my eyes, one way to achieve this goal in Qiime2 would be to only classify at DNA level, after first correcting for frameshifts at the amino acid level and then backtranslating to nucleotides outside of qiime2 environment. But this is taking "one step forward and two steps back" in my eyes which is why I have stuck with analyzing functional genes outside of Qiime2 (e.g. in RDP's Fungene Pipeline)

Hopefully this can initiate some good discussion points, because I imagine others would also like to run both marker/functional gene data in qiime2!

Mark Reynolds

colinbrislawn · August 19, 2019, 4:38pm

Right now, Qiime only supports DNA/RNA and not Amino Acid data. Mark, if your database is of AA genes (and not the DNA that encodes these genes), you might have better luck outside of Qiime.

On the other hand, if you do have the DNA coding regions behind the genes, Qiime 2 would work just fine. All we would need is a good tutorial about how to use Qiime 2 with 'custom PCR primers that target custom genes.'

Given that proteomics is the future, it might be interesting to build q2-types for Amino Acids so that programs like FunGene that use AA could become Qiime2 plugins in the future. Here, the lead dev discribes why Qiime 2 is DNA only for the time being:

sixvable · August 20, 2019, 2:32am

Hello guys!

I do agree with @mcreyno2 's opinions that

one way to achieve this goal in Qiime2 would be to only classify at DNA level, after first correcting for frameshifts at the amino acid level and then backtranslating to nucleotides outside of qiime2 environment. But this is taking “one step forward and two steps back”

I want to cited the pipeline of Fungene to explain this situation.I hope it would not violate the rules since RDPtools is a competation pipeline to qiime2

There are 12 processes in Fungene pipeline . I have passed the tutorial of FGP and I am sure that the first 4 processes about quality controlling and the last 3 processes about analysis can be efficientively implemented in qiime2(such as vsearch plugin). But frameshit corrected and HMMER3 alignment based on AAs could not easily achieved .Which also makes featuretable and representative sequence production based on AAs more difficult.

Still some researchers have tried by combining both pipeline. Here is a complex example
Evaluation of Primers Targeting the Diazotroph Functional Gene and Development of NifMAP – A Bioinformatics Pipeline for Analyzing nifH Amplicon Data](Frontiers | Evaluation of Primers Targeting the Diazotroph Functional Gene and Development of NifMAP – A Bioinformatics Pipeline for Analyzing nifH Amplicon Data)

I would like to run functional gene data in qiime2 so I sincere hope that qiime2 team could develope not only a pipeline but also a strategy about functional gene analysis. Love this community!

Sixvable

Nicholas_Bokulich · August 20, 2019, 2:18pm

Hi everyone,
Great discussion!

It sounds like the frameshift tool is used to find and correct indel errors that cause frameshifts when you translate DNA->protein. Using a denoising method in QIIME 2, like dada2, will also find and correct indel errors. So is the frameshift essential at all for analyzing functional genes in QIIME 2? Even with indels, if the goal is to identify a gene and/or what species it came from then classifying at the DNA level will provide better resolution. So it sounds like the only thing QIIME 2 is not equipped to do right now is to translate DNA->protein to look at things like protein homology. (But I do not do functional gene analysis so please forgive and enlighten my ignorance if classification is preferred at protein level)

If that is the case, then functional gene analysis can be done entirely in QIIME 2 following the same steps as in the tutorials, and then optionally exporting the DNA sequences after classification to perform any additional steps not supported in QIIME 2, e.g., translating to protein to look at homology.

@sixvable you are correct — all of the upstream and downstream steps in the Fungene pipeline can be easily replicated in QIIME 2 but currently nothing exists for the Framebot or HMMER aligner (which might not be essential — sounds like other alignment methods could accomplish the same goal).

It would be great to see a functional gene analysis plugin! Maybe the Framebot developers would even be interested in developing a plugin?

Community-developed plugins like this are the best way forward — QIIME 2 is designed to support easy integration of community-developed plugins so that experts in different fields (e.g., functional gene analysis) can integrate their tools with QIIME 2 quickly and easily, exposing their functionality to the many users of QIIME 2.

Definitely does not violate any rules here! We are not competitors exactly — we are all independent groups of researchers aiming at the same goal, streamlined use of bioinformatics tools. It is valuable to learn from each other and improve interoperability of our platforms.

mcreyno2 · August 20, 2019, 4:51pm

Dear all,

Thanks for the replies!

@Nicholas_Bokulich, I've thought about integrating deblur into this draft Illumina/functional gene pipeline for the positive filtering step. But I was naive and didn't think to throughly compare/check what framebot does "under the hood" (e.g. frameshift mutation correction caused by insertion or deletion of nt in sequencing read) relative to what deblur does "under the hood".

To answer your question, yes, I believe frameshift corrected reads are required for functional genes analysis (in/outside of Qiime2). I need to investigate further to be confident, but my informal training in analyzing functional genes suggests to classify at Amino Acid level rather nucleotide (e.g., all of fungene's databases intended for framebot's frameshit correction are at Amino Acid level). Here's an excerpt from Wang et al. 2013 introducing the Framebot tool:
"Because protein-coding genes often evolve at a higher rate than
rRNA, while the encoded protein sequence evolves at a lower rate,
it can be advantageous to compare protein sequences. However indels, which are common sequencing artifacts, cause frameshifts
and lead to a corrupt protein translation downstream from the
artifact".

Also, as @sixvable's paper shows, DNA vs AA classification of functional gene Illumina reads seems to have no "standard operating procedure" available for. Nonetheless, I greatly appreciate yours and @colinbrislawn's insight here. I will revisit the deblur vs framebot comparison to see if they are achieving the same endpoint on protein-coding DNA such as our mcrA amplicons (e.g. correction of frameshift vs. merely flagging it and removing it from the pool of original seqs).

Lastly, I have sent an email to RDP staff at Michigan State University to link them in on an attempt to promote an community-developed plugin as you recommended. I do know that RDP's wrapper python script for fungene seems to heavily involve Java scripting...

Sincerely,

Mark Reynolds

Nicholas_Bokulich · August 20, 2019, 8:23pm

My naive interpretation: comparing protein sequences seems to be advantageous for some applications, e.g., comparing protein homology, but for classification (identifying the species and/or gene) the DNA sequence is still going to contain more information than the protein. E.g., two species may have identical protein sequences, but their genes contain substitutions that allow us to differentiate the species better than the protein sequence would. Perhaps I need to read that article more thoroughly, but I am still missing why DNA should be translated to protein prior to classification tasks, if indels are removed upstream (by denoising). I understand the need for frameshift if protein translation is necessary, but I am just not yet convinced that protein translation is needed for classification.

These can always be translated to DNA to create databases that you can currently use in QIIME 2.

Also check out dada2. I believe deblur just flags erroneous sequences for removal but dada2 will correct indels.

Thanks! It would be great to hear from them (maybe they can chime in on this forum discussion), and it would be awesome if they are interested in working on a plugin, in which case I and others could certainly help them get started. The fact that fungene is written in JavaScript is NOT a barrier to writing a QIIME 2 plugin.

sixvable · August 21, 2019, 2:36am

Thank u Nick! @Nicholas_Bokulich

I believe what Mark want to say is that what define feature(ASV/OTU) in functional gene is based on AAs not Nts.

I must give a stupid example
There are two hypothetical protein-coding genes which all contain 15AAs including a stop codon :
ATGCCGTCTACCTGA
ATGCCCTCGACGTAA
They share only a few common nt sequence but when u translate it into AAs sequences:
MPST*
MPST*
As u can see that they actually have the same AAs sequences which means most of the time they share the same function and structure. I believe they should be a same feature.
Since gene evolved so quickly especcially in environment but their AAs may not so fast , and functional gene focused on real function. I believe we should use AAs not nt sequences to define a feature.

About the frameshift correction by Framebot:

First it would correct indel ,deletion and mismatchs with a custom AAs reference dataset (based on your gene such as nifH or mcrA). Then u can choose a de novo correction (not default parameter) which I believe is more similiar to DADA2.
Besides they would align the corrected AAs to a custom AAs reference model using HHMER3,filter some AAs which failed in alignment.Finally they will clust all the AAs to features(they called mcClust which I think share a lot to uparse I guess) using not sequences but a aligned sequences.

I believe it is still different to normal rRNA analysis. Maybe I was wrong.

Nicholas_Bokulich · August 21, 2019, 2:06pm

So that makes sense to me if the goal is to determine what unique protein sequences exist, with differences in sequence possibly indicating differences in structure. (and to be clear, I am not arguing against this technique or the utility of implementing this in a QIIME 2 plugin, just trying to learn)

DNA-level analysis would still provide more informative features for species identification, however (e.g., following the example I gave above where two species or subspecies may share the same AA sequence but differ at the DNA level, allowing differentiation).

It could be possible to "have it both ways": perform DNA-level analysis, and use unique DNA sequences as the feature definition. Then add the translated AA sequences as feature metadata (instead of first translating and using AA sequence as the feature definition) — that way DNA sequences can be kept separate and used for taxonomic classification and the AA sequences can be used for other tasks (e.g., looking at protein homology).

Yes, framebot is definitely distinct from a typical 16S analysis! But the other steps in the fungene pipeline seem quite similar. It would be awesome to see a QIIME 2 plugin for framebot and/or other func gene analysis tools!

thanks again for the discussion! learning a lot about func gene analysis methods...