Having trouble with alignment: mafft, SEPP, & SINA

Hello QIIME users,

I’m trying to analyze sequences with QIIME2 that a collaborator previously analyzed with QIIME1, but I’m having trouble with alignment. I’m starting out with file.fna that contains ~1.5 million sequences. I’m running QIIME2-2021.2 using miniconda3/4.8.5 via my institution’s HPC.

All signs point to success with the following:

qiime tools import
--input-path file.fna
--output-path seqs.qza
--type 'SampleData[Sequences]'

qiime vsearch dereplicate-sequences
--i-sequences seqs.qza
--o-dereplicated-table table.qza
--o-dereplicated-sequences rep-seqs.qza

qiime feature-table tabulate-seqs
--i-data rep-seqs.qza
--o-visualization rep-seqs.qzv

My intention was to do this next:

qiime phylogeny align-to-tree-mafft-fasttree
--i-sequences rep-seqs.qza
--o-alignment aligned-rep-seqs.qza
--o-masked-alignment masked-aligned-rep-seqs.qza
--o-tree unrooted-tree.qza
--o-rooted-tree rooted-tree.qza

However, once I try to use mafft, I start getting error messages.

qiime alignment mafft
--verbose
--i-sequences rep-seqs.qza
--p-parttree
--o-alignment aligned-rep-seqs.qza

returns:
nseq = 1463689
groupsize = 1463690, partsize=50

mafft --dpparttree might give a better result, although slow.
mafft --fastaparttree is also available if you have fasta34.


splittbfast (nuc) Version 7.475
alg=M, model=DNA200 (2), 1.53 (4.59), -0.00 (-0.00), noshift, amax=0.0
1 thread(s)

picksize = 50
groupsize = -1
sueff_global = 0.100000
inputfile = infile
alloclen = 75324 in main
/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/bin/mafft: line 2747: 96023 Killed "$prefix/splittbfast" $legacygapopt -Z $algopt $splitopt $partorderopt $parttreeoutopt $memopt $seqtype $model -f "-"$gop -Q $spfactor -h $aof -p $partsize -s $groupsize $treealg $outnum -i infile > pre 2>> "$progressfile"
Traceback (most recent call last):
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/q2cli/commands.py", line 329, in call
results = action(**arguments)
File "", line 2, in mafft
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
output_types, provenance)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in callable_executor
output_views = self._callable(**view_args)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/q2_alignment/_mafft.py", line 128, in mafft
return _mafft(sequences_fp, None, n_threads, parttree, False)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/q2_alignment/_mafft.py", line 100, in _mafft
run_command(cmd, result_fp)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/q2_alignment/_mafft.py", line 26, in run_command
subprocess.run(cmd, stdout=output_f, check=True)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['mafft', '--preservecase', '--inputorder', '--thread', '1', '--parttree', '/tmp/qiime2-archive-d1_7r8b5/548013d5-28de-4c35-8a04-16698eb66cda/data/dna-sequences.fasta']' returned non-zero exit status 1.

Plugin error from alignment:

Command '['mafft', '--preservecase', '--inputorder', '--thread', '1', '--parttree', '/tmp/qiime2-archive-d1_7r8b5/548013d5-28de-4c35-8a04-16698eb66cda/data/dna-sequences.fasta']' returned non-zero exit status 1.

See above for debug info.

I was unable to work out how to add --dpparttree as suggested at the top of the output included above.


I’ve tried to use SEPP to get around this:

wget https://data.qiime2.org/2021.8/common/sepp-refs-silva-128.qza

qiime fragment-insertion sepp
--i-representative-sequences rep-seqs.qza
--i-reference-database sepp-refs-silva-128.qza
--o-tree insertion-tree.qza
--o-placements insertion-placements.qza

But that returns the following error:

Traceback (most recent call last):
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/q2cli/commands.py", line 329, in call
results = action(**arguments)
File "", line 2, in sepp
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
output_types, provenance)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in callable_executor
output_views = self._callable(**view_args)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/q2_fragment_insertion/_insertion.py", line 77, in sepp
debug)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/q2_fragment_insertion/_insertion.py", line 53, in _run
subprocess.run(cmd, check=True, cwd=cwd)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['run-sepp.sh', '/tmp/qiime2-archive-i02jjmed/548013d5-28de-4c35-8a04-16698eb66cda/data/dna-sequences.fasta', 'q2-fragment-insertion', '-x', '1', '-A', '1000', '-P', '5000', '-a', '/tmp/qiime2-archive-d1k6qkge/e44b5e78-31e5-4a0f-9041-494bc3ca2df2/data/aligned-dna-sequences.fasta', '-t', '/tmp/qiime2-archive-d1k6qkge/e44b5e78-31e5-4a0f-9041-494bc3ca2df2/data/tree.nwk', '-r', '/tmp/qiime2-archive-d1k6qkge/e44b5e78-31e5-4a0f-9041-494bc3ca2df2/data/raxml-info.txt', '-b', '1']' returned non-zero exit status 1.

Plugin error from fragment-insertion:

Command '['run-sepp.sh', '/tmp/qiime2-archive-i02jjmed/548013d5-28de-4c35-8a04-16698eb66cda/data/dna-sequences.fasta', 'q2-fragment-insertion', '-x', '1', '-A', '1000', '-P', '5000', '-a', '/tmp/qiime2-archive-d1k6qkge/e44b5e78-31e5-4a0f-9041-494bc3ca2df2/data/aligned-dna-sequences.fasta', '-t', '/tmp/qiime2-archive-d1k6qkge/e44b5e78-31e5-4a0f-9041-494bc3ca2df2/data/tree.nwk', '-r', '/tmp/qiime2-archive-d1k6qkge/e44b5e78-31e5-4a0f-9041-494bc3ca2df2/data/raxml-info.txt', '-b', '1']' returned non-zero exit status 1.


Lastly, I’ve tried to use SINA outside of QIIME:

wget -O- https://www.arb-silva.de/fileadmin/arb_web_db/release_138_1/ARB_files/SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz
| gunzip -c > SILVA_refnr.arb

sina -i file.fna -r SILVA_refnr.arb -o aligned_seqs.fasta

Which appears to successfully run SINA 1.7.2 and return the output file. From there, I have not been able to import the aligned_seqs.fasta file into QIIME2 as an artifact. I’ve tried several different importable formats and types options, but not had any luck.

qiime tools import
--input-path aligned_seqs.fasta
--output-path aligned_rep_seqs.qza
--input-format AlignedDNAFASTAFormat
--type FeatureData[AlignedSequence]

There was a problem importing aligned_seqs.fasta:

aligned_seqs.fasta is not a(n) AlignedDNAFASTAFormat file:

Invalid character '' at position 37839 on line 1288468 (does not match IUPAC characters for this sequence type). Allowed characters are ACGTURYKMSWBDHVN.-.

qiime tools import
--input-path aligned_seqs.fasta
--output-path aligned_rep_seqs.qza
--type FeatureData[AlignedSequence]

There was a problem importing aligned_seqs.fasta:

aligned_seqs.fasta is not a(n) AlignedDNAFASTAFormat file:

Invalid character '' at position 37839 on line 1288468 (does not match IUPAC characters for this sequence type). Allowed characters are ACGTURYKMSWBDHVN.-.

qiime tools import
--input-path aligned_seqs.fasta
--output-path aligned-sequences.qza
--type 'SampleData[Sequences]'

There was a problem importing aligned_seqs.fasta:

aligned_seqs.fasta is not a(n) QIIME1DemuxFormat file


Is there a way forward with any of these paths? Thank you for your help!
Beck

1 Like

Hi @BeckWehrle!

First off, this looks like two separate sets of problems to me (at least right now) - let's treat them as independent for the time being. In the future please open separate issues, this makes it easier for folks to search for help on the forum in the future.

Issue 1

mafft

The error:

Are you running on an HPC? If so, have you requested sufficient resources? This message is implying to me that your HPC system killed the mafft job. When this happens to me on my HPC its because I have used too much RAM and need to request more - but this could be anything - when I peek at the source code for mafft, the lines implied there are working with temporary directories - possibly you have filled your temp dir up and your system is booting the job out for going out-of-bounds. Chat with your sysadmin for advice here.

sepp

Same story as above - reviewing the sepp source, exit code 1 is associated with an inability to create a temporary directory/file on the system.

Issue 2

The crux of this error message is that there is at least one instance of the following characters in this file: '' - that is not a valid base/nt/iupac code - my guess is SINA is writing that to your output (since the import doesn't fail when importing the data pre-SINA), possibly on accident. QIIME 2 is saying "hey, there is garbage in this file, we don't want you to run into issues later, so we just won't import this file." You'll need to review the SINA docs to see if there is a way to control the creation of the output file in such a way that limits those spurious characters, or you will need to add some kind of intermediate cleanup step.

1 Like

Hi, following this topic (as it's very interesting), I think I might some insight into the issue 1; MAFFT tends to be killed by SLURM (I haven't tried in other queue systems tbh) as it tries to write in folders (in the /tmp folders) where regular users in a HPC don't have permission to write. The way to fix this is to run MAFFT in --silent mode.

1 Like

Thanks @thermokarst! This makes a lot of sense– I'm new to using an HPC as well, so it looks like I've got a lot more to figure out there, too. I had not thought about how much RAM a job used as I'd naively assumed that running it in any part of an HPC would more than accommodate anything I would want to do. I'm going to look through the forum to see if I can find some ideas for how to estimate computing needs, but if you have any guidance there that would be helpful!

1 Like

For accountability, I'm checking in with half the solution.

mafft: I tried running the mafft step with 1GB of memory in both verbose and quiet modes, and then with 10GB and 100GB of memory (in quiet mode as I didn't explicitly include quiet or verbose options– @WeedCentipede just want to make sure that I'm interpreting your suggestion correctly as "–quiet" as I didn't see a "–silent mode" in the mafft documentation). I'm still getting out of memory errors and am waiting to hear back from my system admin about what allocations of memory and CPUs are reasonable for me to use for this task.

SINA: Success! I did not expect the aligned_seqs.fasta output file to be so large (170GB) and did not originally have enough space for it. The full file imported just fine using:

    qiime tools import
    --input-path aligned_seqs.fasta
    --output-path aligned_rep_seqs.qza
    --input-format AlignedDNAFASTAFormat
    --type FeatureData[AlignedSequence]

and I ended up with an aligned_rep_seqs.qza file that's much closer to my size expectations at 1.6GB.

I'll post a follow up on mafft/SEPP as I get that information.

1 Like

Hey @BeckWehrle,
It's indeed --quiet (sorry for the lapsus).
Well, if the stout to weird tmp dirs isn't the problem, I don't know what it might be.

I'll point out for future readers of this post, the --quiet flag is for use with mafft, not with q2-alignment.

TL;DR: this worked
sina -i file.fna -r SILVA_refnr.arb --fasta-write-dna --log-file=sinaDNAlog.txt -o aligned_seqsDNA.fasta

Here's where this has shaken out:

  • I have been unable to be successful using the qiime alignment mafft because of out of memory errors on my institution's HPC/SLURM scheduler. I last tried using 187GB of memory (the max I can access), which from the following seff output of the efficiency, it looks like was still not enough. It went for almost 8 hours then failed. I'm calling this one a dead end for me, but perhaps these estimates will be helpful to someone else in the future.
    Cores per node: 32
    CPU Utilized: 07:48:24
    CPU Efficiency: 3.12% of 10-10:24:00 core-walltime
    Job Wall-clock time: 07:49:30
    Memory Utilized: 172.84 GB
    Memory Efficiency: 92.43% of 187.00 GB

  • qiime fragment-insertion sepp continues to fail with the same error code as in the original post. It seemed happy to run with 10GB of memory until that failure, but I'm giving up here because I got the next method to work.

  • the SINA command outside of qiime was a success! But update from my previous post on that. The code as written produced an invalid file because SINA defaults to using the SILVA reference database as RNA instead of DNA. So for the purposes of continuing on to masking and making a tree, it's important to add --fasta-write-dna to end up with a compatible file.

sina -i file.fna -r SILVA_refnr.arb --fasta-write-dna --log-file=sinaDNAlog.txt -o aligned_seqsDNA.fasta

The log-file option was useful as when using 6GB (10GB) it took 42hrs (37hrs) to complete and I could see that things were moving forward.
From there the following worked beautifully:

qiime tools import
--input-path aligned_seqsDNA.fasta
--output-path aligned_rep_seqsDNA.qza
--input-format AlignedDNAFASTAFormat
--type FeatureData[AlignedSequence]

I have been timing out on my masking step, but I will assume this is unrelated to the alignment process and call SINA alignment a success!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.