Alignment Problem

akknight216 · June 16, 2017, 12:33pm

Hi,

I'm trying to use alignment mafft as follows:

qiime alignment mafft --i-sequences rep-seqsMerged_060817_250.qza --o-alignment aligned-rep-seqsMerged.qza

And am getting this error:

Plugin error from alignment:

  InPath('/tmp/q2-AlignedDNAFASTAFormat-e5oap_ts') is not formatted as a
  AlignedDNAFASTAFormat file.

I'll rerun it with the verbose flag, but it takes several days to error out. Is this an issue you've seen before?

Best,

Anna

ebolyen · June 16, 2017, 4:20pm

Hey @akknight216!

Sorry I don't think we've seen this before. It looks like QIIME 2 is rejecting the result of MAFFT's output which is pretty strange.

Could you provide a sample of a few of the sequences in your rep-seqsMerged_060817_250.qza? (You can use qiime tools export to get at that file quickly, it should just contain a single dna-sequences.fasta file.)

I'm wondering if these aren't DNA sequences for some reason. As outside of an invalid character, the only other reason QIIME 2 would have rejected the output would be if the first few sequences' lengths did not match in the alignment.

P.S. Newer versions of QIIME 2 will write out the verbose output to a temporary file if something goes wrong automatically, so you won't need to re-run with --verbose anymore!

akknight216 · June 16, 2017, 7:41pm

Here's the top of that file:

>0000115810252d3dc36afd4850f547b4
CCTACGGGGGGCAGCAGTGAGGAATATTGGTCAATGGACGAGAGTCTGAACCAGCCAAGTAGCGTGAAGGATGAAGGTTCTATGGATTGTAAACTTCTTTTATACGGGAATAAACGAATCCACGCGTGGATTTTTGCATGTACCGTATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGGTTGTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAATTGATACTGGCAGTCTTGAGTACAGTTGAGGTAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTTACTAACCTGTAACTGACATTGATGCTCGAAAGTGTGGGTATCAAACAGGATTAGATACCCTGGTAGTC
>000015ef4b9bea490aec4ad4b1b4a206
CCTACGGGAGGCAGCAGTGGGGAATCTTCCGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGAGTGAAGACGGTCTTCGGATTGTAAAGCTCTGTTATACGGGACGAACGGCAAGGTGGTAAATAGCCATCATGAGTGACGGTACCGTAAGAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTATGGGGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGTACGTAGGCGGTTTTTTAAGTCAGGTGTCAAAGCGTGGAGCTTAACTCCATTAAGCACTTGAAACTGAAAGACTTGAGTGAAGGAGAGGAAAGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCGGTGGCGAAGGCGGCTCTCTGGTCTGTTACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAGGATTAGATACCCTTGTAGTC
>00001ec4dd32ff4f17b97a1eb14fb3b6
CCTACGGGTGGCACTGCATTCCCTATAACCTTGTATCGTGACTGTAGAGGGATTGCGTCAAAGCTGAATTCATCGGGAAAAGATTGAATTCTTGCTATTTCTCTTACCTTGTATCTTCTGTTTTCTATCGGGTGGGTTATTCCGCAATTCTCAGGCTGCGCCGAGGCGGTGATGGTACCGTTGATTTCCCCTAGCGCGAAACGACGGTAGAAATTTGGCGAATGATACCTCTTTGGATCGTCATAAATCCTCTTGAAGCGAGGAGGAAGTTTGTCCGGATCAATGTTTTTCCATGACCTTCCCTCTTTGAAGGCTTCTGGAATATCAGAAATCTGAAAGGGTTGCTTAAAGAAAGGGAGAGCCTCCGCCCCAGCCCTGCACGGACCGATCATATTAACCATGTTTTGAGTTGCCGCGGTGTACTCCCAGTAGTC
>00001f0a34f5b35fd2433056c01e7c50
CCTACGGGTGGCAGCAGTAGGGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGGTTTCGGCTCGTAAAGCTCTGTTGTTGGTGAAGAAGAACAGGGGTAGTAACTGACCTTTGTTTGACGGTAATCAATTAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGCTCGATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTCGAGCTTGAGTACAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCGGTGGCGAAGGCGGCTCTCTGGTCTGTTACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAGGATTAGATACCCTGGTAGTC

akknight216 · June 16, 2017, 7:43pm

The descriptive lines do start with the ">", I don't know why it isn't showing up in the pasted version.

Also, just out of curiosity since this is so strange, I recently reran dada2 and the only difference was whether I trimmed 10 nucleotides off the front of the forward read, and suddenly I went from 30,000 features with the trimming to >300,000 features. Is that to be expected? And potentially related to this problem? Though the feature table does write out normally.

ebolyen · June 16, 2017, 8:38pm

(The forum is eating that character to make a quote block I changed it to a code block so now it's treated literally.)

Those sequences seem fine. Would you be able to send me that qza file (in a DM), I'd like to try and reproduce this locally. There is probably something up with MAFFT here.

That seems reasonable, the first few nucleotides are usually pretty poor quality. Does trimming the reverse in the same way reduce the feature count further? The 30,000 sounds much more reasonable. This shouldn't impact the alignment however in practice (other than really poor runtime).

ebolyen · June 19, 2017, 6:00pm

Just a general update, I've received credentials to look into this problem further.

dgerrity · June 27, 2017, 7:55pm

I received the same error while trying to use alignment mafft. I can also send you my qza file if necessary, but I will wait to see what you find out with Anna's dataset. A debug .log file was also saved in case that is helpful.

ebolyen · June 27, 2017, 8:26pm

@dgerrity, how large is your .qza? It could potentially serve as a simpler test-case.

dgerrity · June 27, 2017, 11:10pm

@ebolyen, my .qza is 9.1 MB. Is it possible that the Feature ID format is causing a problem? I attached a screenshot based on the corresponding .qzv visualization. I started with a file provided by my sequencing lab.

ebolyen · June 28, 2017, 6:40pm

@akknight216, I think I know what's going on at this point.

I've been running MAFFT for the last day and it's already taken about 155gb of memory. I suspect what is happening is that MAFFT is running out of memory and not returning an appropriate exit code. Looking into the changelog for MAFFT, there does seem to be a suspicious entry under v7.310 2017/3/17 which is newer than the MAFFT on bioconda. But I couldn't find any further details about what specifically was wrong with the exit code.

So what I suspect is happening is that MAFFT is getting a failure from malloc and is accidentally returning 0 as the exit code, which means q2-alignment doesn't realize anything has gone wrong. From there it tries to read the alignment fasta file which is where MAFFT would have put the results, but it is an empty file because MAFFT never finished and so QIIME 2 complains that the format looks wrong.

Looking further into your data-set (and the provenance) it seems that MAFFT is being asked to align ~400,000 sequences which is a pretty heroic task. The general workflow you have seems fine, but I think you're losing out on a lot of the quality filtering that DADA2 can provide. I would probably recommend setting a trunc_len and a trim_left where the quality of your reads starts dropping significantly. This will results in far fewer features.

If you have some other motivation for avoiding this quality control, we're going to hopefully have a vsearch plugin shortly which will allow you to cluster your features based on sequence identity. This would also solve the issue of too many features to align, although my understanding is improving the quality control is probably the more robust approach generally so I would only try clustering if you still had too many features after stricter quality-control.

P.S. I'm going to let the alignment run a little bit longer on our server as we haven't completely run out of memory yet (about 50% of the way there) on the off-chance it finishes, but with an alignment of this size I'm not super optimistic.

@dgerrity, it is possible that the IDs are causing problems, but they shouldn't be. Would you be able to post your log file? Otherwise, I suspect you are running into the same issues as @akknight216. How many features do you have that are being aligned?

ebolyen · June 28, 2017, 8:19pm

Thanks for the logfile @dgerrity (sent in a DM).

After inspecting it, it does appear that a child-process of MAFFT is getting killed by a SIGKILL (almost certainly because it runs out of memory) and it also looks like MAFFT is not then returning that exit code, so our plugin has no idea anything went wrong and passes along the empty file as the result.

We can probably make the error reported by QIIME 2 better by upgrading our MAFFT binary.

But in either case, the actual solution here is reduce the number of features with some better quality-control (or get a really-really big computer to run this on). Without better quality-control you can expect the runtime of this step to be measured in weeks.

Thanks everyone!

jairideout · June 28, 2017, 9:26pm

Linking to corresponding issues on our tracker, we'll follow up here when they're resolved. This won't solve the general problem of aligning a huge number of sequences, but we can at least make the QIIME 2 framework detect empty files, and also try upgrading our version of MAFFT to handle exit codes more robustly. Thanks!

github.com/qiime2/qiime2

Identify when a format is empty

opened 08:24PM - 28 Jun 17 UTC

ebolyen

enhancement

**Improvement Description** If a given sniffer/validator for a format fails, it… would be pretty handy to check to see if there is anything in the file and report whether it was empty or not. It would have made debugging [this issue](https://github.com/qiime2/q2-alignment/issues/34) much simpler. **Comments** This might tie into the more general notion of "why" a format has failed a check, but it seems like a common enough situation to potentially bake-in. **References** [this issue](https://github.com/qiime2/q2-alignment/issues/34)

github.com/qiime2/q2-alignment

MAFFT binary returns exit code 0 when its own child process is killed (OOM)

opened 08:21PM - 28 Jun 17 UTC

closed 04:44PM - 23 Oct 17 UTC

ebolyen

type:bug

This [came up on the forum](https://forum.qiime2.org/t/alignment-problem/708/12)…. I believe upgrading our mafft binary will fix this, as there is a [changelog note](http://mafft.cbrc.jp/alignment/software/changelog.html) that says: > + Fixed a long-standing problem on exit code. But as of right now users will get this error: ``` InPath('/tmp/q2-AlignedDNAFASTAFormat-e5oap_ts') is not formatted as a AlignedDNAFASTAFormat file. ``` which is misleading.

benjjneb · June 28, 2017, 10:41pm

That is not expected behavior unless... do you have primers/linker/non-bio-nts in your reads? If so, you need to remove them (whether by trimming off the primer with trim-left or with a dedicated tool).

DADA2 (and any method that infers exact sequences) will interpret the ambiguous nucleotides that are in most primers as biological variation, which can artefactually multiply the number of features you get out at the end. For normal data, w/o primer sequence, the number of features should change relatively little when chopping off 10 nts.

akknight216 · July 6, 2017, 2:24pm

Thanks for looking into this! The scary thing was that were were using the dada2 filtering- truncating at 250 for the forward and reverse. Our sequence quality drops off before that, but if I cut too much more off the reads won't overlap.

I think we've addressed the QC issue in a different way- I finally figured out how to filter for the presence of a primer sequence first, discard the pair if one is missing a primer, and then use the trimming parameter F(10,250) and R(0,250). That has us down to about 15,000 features, which seems more reasonable. But by the end of that pipeline we only keep ~5% of our original reads. We loose 25% due to lack of primer, and other 70% after dada2 quality filtering.

ebolyen · July 6, 2017, 11:22pm

If failure to get your reads to overlap is looking like an issue, you can always "fall back" to single-end.

qiime dada2 denoise-single will accept paired-end data, but it will just use the forward reads so that you can easily try it out and see if it makes a substantial difference.

benjjneb · July 7, 2017, 3:24am

This is a sign of something going wrong: There is a problem if you are losing a large majority of your reads during the dada2 denoising step. My suspicion is that you still have primers on your reads, which interferes with both denoising and chimera detection and can cause way too many reads to be lost, but non-overlap is another possibility.

The easiest way to diagnose your issue would be if you can share the fastq file from one of your samples?

system · August 7, 2017, 9:24am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.

jairideout · October 27, 2017, 6:41pm

The latest release of QIIME 2 (2017.10) supports a newer version of MAFFT (7.310), which gracefully handles out-of-memory errors, among other types of errors that can arise while aligning sequences.