MAFFT Error: non-zero exit status 1

Purrsia_Felidae · October 6, 2018, 2:37am

Dear Qiime2 White Wizards:

I thought I would follow up with my previous post: Mafft 'returned non-zero exit status 1' ERROR - #9

First of all, THANK you for putting the --parttree option for the mafft alignment. I thought i would note that I am still running into MAFFT errors:

I ran the following command, which produced the following error after 20 hours of run time.

qiime alignment mafft --i-sequences rep-seqs.qza --o-alignment mafft_aligned-rep-seqs.qza --p-parttree --p-n-threads 0

Plugin error from alignment:

Command '['mafft', '--preservecase', '--inputorder', '--thread', '-1', '--parttree', '/var/folders/z4/jfbc76zs7_l8q_4xb8rgtmf00000gn/T/qiime2-archive-m77v99nb/1193edc7-297d-4a84-975c-0fa9b22d246d/data/dna-sequences.fasta']' returned non-zero exit status 1

Based on previous suggestions, I decided to go to MAFFT directly and ran the following on my sequences (~1.5 million sequences) (same .fna sequences)

mafft --parttree --thread 1 seqs.fna > mafft_out
Wall Time Used : 4-00:00:14
State : TIMEOUT (exit code 0)
CPU Efficiency : 0.00%
Memory Requested : 300.00 GB (300.00 GB/node)
Memory Used : 0.00 MB (estimated maximum)

I tried multiple versions of this command, altering my parameters. Each time I increased my --thread parameter, it would error out due to lack of memory (memory requirement would go up the roof!). Problem is, we don't have > 300GB of memory available for our group on our cluster (costs are very high for this) and so I have to use our cluster's burst mode, which puts a time limit to run jobs, hence the time out error.

It seems like MAFFT requires an enormous amount of memory to run, both via MAFFT and also via the Qiime2 option.

With Qiime 1 we could choose our aligment options (e.g. clustalw, mafft, muscle, pynast). Any plans on maybe having an option to use a different aligner besides MAFFT?

Many thanks for even reading this far!

ebolyen · October 9, 2018, 5:47pm

Hey @Purrsia_Felidae,

Funny you should mention alternative aligners, @epruesse is working on a SINA aligner as we speak, however given the size of your dataset (and assuming your ultimate goal for alignment is to generate a phylogeny), you could try using a SEPP algorithm provided by this plugin:

https://github.com/biocore/q2-fragment-insertion

Of course, we could always use more help, so if you have a bit of Python experience, you could always create some PRs for q2-alignment adding more (or even create your own plugin which does whatever you want).

epruesse · October 9, 2018, 10:37pm

1.5 million sequences will be hard on progressive alignment tools such as clustalw, mafft or muscle. Just building the guide tree defining the alignment order will take insane memory. That's why Infernal, NAST and SINA exist - to build alignments of that size.

Questions:

Can I assume you have already dereplicated 16S amplicon sequences in your rep-seqs.qza?
How much memory does your server have available?
Would you try some beta software to gain speed?

Waiting for the next Qiime2 release may not help you too much. But running SINA outside of Qiime2 isn't that hard. You could just do something like this:

conda install sina
wget  -O- https://www.arb-silva.de/fileadmin/arb_web_db/release_132/ARB_files/SILVA_132_SSURef_NR99_13_12_17_opt.arb.gz \
 | gunzip -c > silva_refnr_132.arb
sina -i input.fasta --db silva_refnr_132.arb -o output.fasta

it may take a while though. I've got a newer version of SINA in beta that'll multithread internally and manage to align a few million sequences a day on a dual xeon type server.

Purrsia_Felidae · October 13, 2018, 2:03am

epruesse:

Howdy! Thank you very much for your response.

Yes, I am passing dereplicated sequences to w/ this command:

qiime alignment mafft --i-sequences rep-seqs.qza --o-alignment mafft_aligned-rep-seqs.qza --p-parttree --p-n-threads 0

-Where my rep-seqs.qza are my dereplicated sequences.

I don't mind trying your new SINA in beta that will multithread. I will have to have our cluster upload it as I think if I run it on my computer with my 1.5 million seqs, it will take it until the end of this century to complete.

JB

Purrsia_Felidae · October 14, 2018, 9:25pm

Ebolyen:

Thank you for your response. I looked into the q2-fragment-insertion command and I noticed that the default reference phylogeny is Greengenes 13_8 at 99%. This db is getting quite old. I am wondering if Qiime2 has this available for the silva_132 nr database? I tried to use the --i-reference-alignment parameter as the silva_132 nr.fasta sequences and the --i-reference-phylogeny parameter as the taxonomy file associated with those sequences (both obtained from the Qiime folder in the SILVA website), but this isn't right.

If there aren't any silva_132 nr seqs available to pass into this command, may I ask - how can I obtain the required input for the --i-reference-phylogeny parameter with the silva_132 nr fasta sequences?

Many thanks!

ebolyen · October 15, 2018, 11:26pm

Hi @Purrsia_Felidae,

For those inputs you'll need to import your reference data as an Artifact.

The reference alignment will be a FeatureData[Alignment] type and the file you have should work I think (no need to specify --input-format here, the default is fasta).

For the reference phylogeny, you are looking for a newick file. That should be imported as Phylogeny[Rooted] (same deal with --input-format, default is a newick file).

Hope that helps!

JoseM · November 2, 2018, 11:15am

Thank you guys, went trough the same issues and got them fixed thanks to this post!
Jose

system · December 3, 2018, 5:15pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.