Having trouble with alignment: mafft, SEPP, & SINA

Hello QIIME users,

I’m trying to analyze sequences with QIIME2 that a collaborator previously analyzed with QIIME1, but I’m having trouble with alignment. I’m starting out with file.fna that contains ~1.5 million sequences. I’m running QIIME2-2021.2 using miniconda3/4.8.5 via my institution’s HPC.

All signs point to success with the following:

qiime tools import
--input-path file.fna
--output-path seqs.qza
--type 'SampleData[Sequences]'

qiime vsearch dereplicate-sequences
--i-sequences seqs.qza
--o-dereplicated-table table.qza
--o-dereplicated-sequences rep-seqs.qza

qiime feature-table tabulate-seqs
--i-data rep-seqs.qza
--o-visualization rep-seqs.qzv

My intention was to do this next:

qiime phylogeny align-to-tree-mafft-fasttree
--i-sequences rep-seqs.qza
--o-alignment aligned-rep-seqs.qza
--o-masked-alignment masked-aligned-rep-seqs.qza
--o-tree unrooted-tree.qza
--o-rooted-tree rooted-tree.qza

However, once I try to use mafft, I start getting error messages.

qiime alignment mafft
--verbose
--i-sequences rep-seqs.qza
--p-parttree
--o-alignment aligned-rep-seqs.qza

returns:
nseq = 1463689
groupsize = 1463690, partsize=50

mafft --dpparttree might give a better result, although slow.
mafft --fastaparttree is also available if you have fasta34.


splittbfast (nuc) Version 7.475
alg=M, model=DNA200 (2), 1.53 (4.59), -0.00 (-0.00), noshift, amax=0.0
1 thread(s)

picksize = 50
groupsize = -1
sueff_global = 0.100000
inputfile = infile
alloclen = 75324 in main
/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/bin/mafft: line 2747: 96023 Killed "$prefix/splittbfast" $legacygapopt -Z $algopt $splitopt $partorderopt $parttreeoutopt $memopt $seqtype $model -f "-"$gop -Q $spfactor -h $aof -p $partsize -s $groupsize $treealg $outnum -i infile > pre 2>> "$progressfile"
Traceback (most recent call last):
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/q2cli/commands.py", line 329, in call
results = action(**arguments)
File "", line 2, in mafft
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
output_types, provenance)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in callable_executor
output_views = self._callable(**view_args)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/q2_alignment/_mafft.py", line 128, in mafft
return _mafft(sequences_fp, None, n_threads, parttree, False)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/q2_alignment/_mafft.py", line 100, in _mafft
run_command(cmd, result_fp)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/q2_alignment/_mafft.py", line 26, in run_command
subprocess.run(cmd, stdout=output_f, check=True)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['mafft', '--preservecase', '--inputorder', '--thread', '1', '--parttree', '/tmp/qiime2-archive-d1_7r8b5/548013d5-28de-4c35-8a04-16698eb66cda/data/dna-sequences.fasta']' returned non-zero exit status 1.

Plugin error from alignment:

Command '['mafft', '--preservecase', '--inputorder', '--thread', '1', '--parttree', '/tmp/qiime2-archive-d1_7r8b5/548013d5-28de-4c35-8a04-16698eb66cda/data/dna-sequences.fasta']' returned non-zero exit status 1.

See above for debug info.

I was unable to work out how to add --dpparttree as suggested at the top of the output included above.


I’ve tried to use SEPP to get around this:

wget https://data.qiime2.org/2021.8/common/sepp-refs-silva-128.qza

qiime fragment-insertion sepp
--i-representative-sequences rep-seqs.qza
--i-reference-database sepp-refs-silva-128.qza
--o-tree insertion-tree.qza
--o-placements insertion-placements.qza

But that returns the following error:

Traceback (most recent call last):
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/q2cli/commands.py", line 329, in call
results = action(**arguments)
File "", line 2, in sepp
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
output_types, provenance)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 390, in callable_executor
output_views = self._callable(**view_args)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/q2_fragment_insertion/_insertion.py", line 77, in sepp
debug)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/site-packages/q2_fragment_insertion/_insertion.py", line 53, in _run
subprocess.run(cmd, check=True, cwd=cwd)
File "/opt/apps/miniconda3/4.8.5/envs/qiime2-2021.2/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['run-sepp.sh', '/tmp/qiime2-archive-i02jjmed/548013d5-28de-4c35-8a04-16698eb66cda/data/dna-sequences.fasta', 'q2-fragment-insertion', '-x', '1', '-A', '1000', '-P', '5000', '-a', '/tmp/qiime2-archive-d1k6qkge/e44b5e78-31e5-4a0f-9041-494bc3ca2df2/data/aligned-dna-sequences.fasta', '-t', '/tmp/qiime2-archive-d1k6qkge/e44b5e78-31e5-4a0f-9041-494bc3ca2df2/data/tree.nwk', '-r', '/tmp/qiime2-archive-d1k6qkge/e44b5e78-31e5-4a0f-9041-494bc3ca2df2/data/raxml-info.txt', '-b', '1']' returned non-zero exit status 1.

Plugin error from fragment-insertion:

Command '['run-sepp.sh', '/tmp/qiime2-archive-i02jjmed/548013d5-28de-4c35-8a04-16698eb66cda/data/dna-sequences.fasta', 'q2-fragment-insertion', '-x', '1', '-A', '1000', '-P', '5000', '-a', '/tmp/qiime2-archive-d1k6qkge/e44b5e78-31e5-4a0f-9041-494bc3ca2df2/data/aligned-dna-sequences.fasta', '-t', '/tmp/qiime2-archive-d1k6qkge/e44b5e78-31e5-4a0f-9041-494bc3ca2df2/data/tree.nwk', '-r', '/tmp/qiime2-archive-d1k6qkge/e44b5e78-31e5-4a0f-9041-494bc3ca2df2/data/raxml-info.txt', '-b', '1']' returned non-zero exit status 1.


Lastly, I’ve tried to use SINA outside of QIIME:

wget -O- https://www.arb-silva.de/fileadmin/arb_web_db/release_138_1/ARB_files/SILVA_138.1_SSURef_NR99_12_06_20_opt.arb.gz
| gunzip -c > SILVA_refnr.arb

sina -i file.fna -r SILVA_refnr.arb -o aligned_seqs.fasta

Which appears to successfully run SINA 1.7.2 and return the output file. From there, I have not been able to import the aligned_seqs.fasta file into QIIME2 as an artifact. I’ve tried several different importable formats and types options, but not had any luck.

qiime tools import
--input-path aligned_seqs.fasta
--output-path aligned_rep_seqs.qza
--input-format AlignedDNAFASTAFormat
--type FeatureData[AlignedSequence]

There was a problem importing aligned_seqs.fasta:

aligned_seqs.fasta is not a(n) AlignedDNAFASTAFormat file:

Invalid character '' at position 37839 on line 1288468 (does not match IUPAC characters for this sequence type). Allowed characters are ACGTURYKMSWBDHVN.-.

qiime tools import
--input-path aligned_seqs.fasta
--output-path aligned_rep_seqs.qza
--type FeatureData[AlignedSequence]

There was a problem importing aligned_seqs.fasta:

aligned_seqs.fasta is not a(n) AlignedDNAFASTAFormat file:

Invalid character '' at position 37839 on line 1288468 (does not match IUPAC characters for this sequence type). Allowed characters are ACGTURYKMSWBDHVN.-.

qiime tools import
--input-path aligned_seqs.fasta
--output-path aligned-sequences.qza
--type 'SampleData[Sequences]'

There was a problem importing aligned_seqs.fasta:

aligned_seqs.fasta is not a(n) QIIME1DemuxFormat file


Is there a way forward with any of these paths? Thank you for your help!
Beck

1 Like

Hi @BeckWehrle!

First off, this looks like two separate sets of problems to me (at least right now) - let's treat them as independent for the time being. In the future please open separate issues, this makes it easier for folks to search for help on the forum in the future.

Issue 1

mafft

The error:

Are you running on an HPC? If so, have you requested sufficient resources? This message is implying to me that your HPC system killed the mafft job. When this happens to me on my HPC its because I have used too much RAM and need to request more - but this could be anything - when I peek at the source code for mafft, the lines implied there are working with temporary directories - possibly you have filled your temp dir up and your system is booting the job out for going out-of-bounds. Chat with your sysadmin for advice here.

sepp

Same story as above - reviewing the sepp source, exit code 1 is associated with an inability to create a temporary directory/file on the system.

Issue 2

The crux of this error message is that there is at least one instance of the following characters in this file: '' - that is not a valid base/nt/iupac code - my guess is SINA is writing that to your output (since the import doesn't fail when importing the data pre-SINA), possibly on accident. QIIME 2 is saying "hey, there is garbage in this file, we don't want you to run into issues later, so we just won't import this file." You'll need to review the SINA docs to see if there is a way to control the creation of the output file in such a way that limits those spurious characters, or you will need to add some kind of intermediate cleanup step.

1 Like

Hi, following this topic (as it's very interesting), I think I might some insight into the issue 1; MAFFT tends to be killed by SLURM (I haven't tried in other queue systems tbh) as it tries to write in folders (in the /tmp folders) where regular users in a HPC don't have permission to write. The way to fix this is to run MAFFT in --silent mode.

1 Like

Thanks @thermokarst! This makes a lot of sense– I'm new to using an HPC as well, so it looks like I've got a lot more to figure out there, too. I had not thought about how much RAM a job used as I'd naively assumed that running it in any part of an HPC would more than accommodate anything I would want to do. I'm going to look through the forum to see if I can find some ideas for how to estimate computing needs, but if you have any guidance there that would be helpful!

1 Like