new plugin for q2: is it possible to have mothur dependency ?

Francis29029 · May 13, 2020, 8:43am

Hello all,

The purpose of this post if to get some guidance for making a new plugin for qiime2: we have implemented a pipeline for taxonomic annotation at species/sub-species level for the Bifidobacterium genus (based on 16s sequencing). Our process is described here: https://doi.org/10.1111/1462-2920.14274).

Technically peaking our current “Bifidobacterium species & sub-species annotation” is an R script and uses as entry point an alignment (made in mothur with the align.seqs function) of the representatives sequences (from Bifidobacterium genus) on aligned Silva database.

This alignment done in mothur with align.seqs is absolutely key and we did not find any exact equivalent in q2 (possibly the MAFTT will be updated soon but in our tests (see here) we do not obtain the same results with MAFFT (vs align.seqs).

We were wondering if we could use, in a q2 plugin, mothur as a dependency ? (or eventually only the mothur align.seqs function ?)

Thanks a lot

FRancis

thermokarst · May 13, 2020, 2:29pm

Hi @Francis29029!

Of course! Generally speaking, we recommend that all plugin dependencies be published as conda packages - mothur is available on the bioconda channel, so you're all set there. You'll be able to register the appropriate citations within your plugin, so the mothur team (and any other tools you integrate) will be cited as appropriate!

We are busy working on the revamped Library (link) - one of the major goals there is to allow for plugin developers to get automatic building and testing of their conda packages, so keep your eyes peeled for further information on that (I'm going to get back to work on it as soon as I finish responding here!).

If you have any questions, or run into any hurdles, you know where to find us.

PS - bookmark dev.qiime2.org, there are some developer docs and notes available there

Nicholas_Bokulich · May 13, 2020, 3:57pm

@Francis29029,

You could probably build your plugin by recycling a lot from q2-alignment: e.g., see how mafft subprocesses are called here, and how the different types and formats are handled under the hood. You would want to do something similar for calling mothur subprocesses in python.

Good luck!

Francis29029 · May 27, 2020, 1:50am

Hello thermokarst,
Thanks for your answer, it's re-assuring to know that this align.seqs step in mothur is "plugin compatible" using bioconda. According this page the mothur version 1.39.5 is available in bioconda (this is the version we are using). I will work on that and definitively keep an eye on dev.qiime2.org
Thanks.

Francis29029 · May 27, 2020, 1:53am

Thanks Nicholas, it's a good idea to get inspired by looking another plugin

lea_si · June 19, 2020, 3:08pm

Hi there,

I am working with Francis on this integration of mothur's align.seqs() and following the example of mafft in q2-alignment, I got it kind of working thanks to your guidance in this thread!

Where I am stuck now is that mothur automatically generates output files, meaning I have no control on their name/path. I can retrieve the output file, but don't really know what to do with it... See where I am at so far, and the "One command missing here" comment:

import qiime2
from pathlib import Path
import subprocess
from q2_types.feature_data import DNAFASTAFormat, AlignedDNAFASTAFormat

# Inspired from https://github.com/qiime2/q2-alignment/blob/master/q2_alignment/_mafft.py
def run_mothur(cmd, verbose=True):
    if verbose:
        print("Running external command line application. This may print "
              "messages to stdout and/or stderr.")
        print("The command being run is below. This command cannot "
              "be manually re-run as it will depend on temporary files that "
              "no longer exist.")
        print("\nCommand:", end=' ')
        print(cmd, end='\n\n')
        subprocess.run(cmd, check=True, shell=True)

def align_mothur(sequences: DNAFASTAFormat)-> AlignedDNAFASTAFormat:
    sequences_fp = str(sequences)
    sequences_escaped_fp=sequences_fp.replace("-", "\-") # Mothur does not allow hyphens in input names
    result = AlignedDNAFASTAFormat()
	
    cmd = "mothur \"#set.dir(debug="+str(Path(sequences_fp).parent)+");align.seqs(fasta="+sequences_escaped_fp+", reference="+str(Path(__file__).parent)+"/assets/silva.bacteria.fasta);\""
    run_mothur(cmd)
	
    # Path to output file is in alignment_fp defined below:
    alignment_fp=str(Path(sequences_fp).parent/(Path(sequences_fp).stem+".align"))
    # One command missing here: How do I return the content of alignment_fp into result, which is an AlignedDNAFASTAFormat() object?
    return result

Once again, I'm sure it's a question of types conversion which I definitely have issues with... Thanks in advance for your help

thermokarst · June 23, 2020, 11:21pm

Hi! This is still on my radar, but we are knee-deep in releasing the 2020.6 version of QIIME 2 - once we get that out the door I should have some time to lend a hand - if you don't hear from me by this time next week feel free to ping me here - thanks!

thermokarst · June 30, 2020, 3:16pm

Hey @lea_si, is there a code repo somewhere that I could look at?

lea_si · July 1, 2020, 2:22pm

Hi @thermokarst ! Good opportunity for me to finally learn how git works

First very rough scripts are here: https://github.com/leasi/q2-moulinette

You need to unzip q2_moulinette/assets/silva.bacteria.fasta.zip in the q2_moulinette/assets/ directory (file is too big for Github and I did not manage to make git lfs work behind our firewall)

Let me know if you need anything else!

Nicholas_Bokulich · July 2, 2020, 2:30am

Thanks @lea_si!

This is one of those problems where writing the code with copious comments seemed more productive. I have issued a pull request to your github repository — note I have not tested as I don't have mothur installed! — but these changes should get you moving. You can pull down my branch, give it a spin, and if you are still lost let me know (and let me know what error you are getting) and I can revise (maybe I'll even install mothur this time )

I put inline comments to explain what I changed (or at least where you had commented in questions), but let me know if anything is unclear and @thermokarst or I can elaborate.

Let me know if that helps!

lea_si · July 2, 2020, 7:56am

Dear @Nicholas_Bokulich,

Thank you so much for your code review and corrections (incl. me overlooking config lines from Claire's plugin I used as template I'll definitely do some code cleanup before releasing a more complete version !) I'm not very fluent in Python yet so it definitely helps to get your experienced input. And I just tested your edits, they work fine!

Now I have a more conceptual / "QIIME2 philosophy" question : for our plugin to work, we can only use one specific version of the SILVA database as a reference alignement ; this is why I put the reference fasta file as an asset and imposed it in the align_mothur function. I understand the appeal to generalize align_mothur to any kind of input, but then the next steps of our plugin would not work using another reference.

What would you advise here for the development of our plugin? Keeping the restricted reference input, or allowing all kinds of reference but then somehow (how?) restricting the use of the next steps to only if "our" reference was used?

Hope my question is clear

Nicholas_Bokulich · July 2, 2020, 2:56pm

Glad to help, doubly glad to hear my changes worked without testing!

Is that because that alignment is necessary for your published method (e.g., it is a curated alignment of Bifidobacterium?) or because mothur will break if the alignment does not have some special characteristics? I assume the former, but if the latter then let me know.

I think leave that method flexible, but then you can create a QIIME 2 Pipeline to run your complete Bif classification method, which will have that specific alignment hardcoded as the reference. You already started such a pipeline, and so I'd recommend layering in additional actions to that pipeline to replicate your Bif classifier. So the full pipeline could look like this:

classify query sequences
filter to only Bifs (what you currently have in the pipeline)
align to reference (calling the specific alignment reference)
do whatever else needs doing.

So you can keep that file in assets. Instead of zipping it, make it a QZA (which will zip it). That way it can be easily loaded and manipulated as part of the pipeline.

Would that work for your purposes?

lea_si · July 2, 2020, 3:41pm

First option indeed, the SNVs we use to discriminate amongst Bifs species/subspecies were defined on specific positions of that alignment in particular.

This is crystal clear and it all starts to make sense now (generating a pipeline and using the general align_mothur method in it). In theory this would work perfectly, now I just have to code it! I will let you know when the plugin is complete in a separate thread (or if I get stuck again )

Thank you again for the great guidance!