Sorry I don’t think we’ve seen this before. It looks like QIIME 2 is rejecting the result of MAFFT’s output which is pretty strange.
Could you provide a sample of a few of the sequences in your rep-seqsMerged_060817_250.qza? (You can use qiime tools export to get at that file quickly, it should just contain a single dna-sequences.fasta file.)
I’m wondering if these aren’t DNA sequences for some reason. As outside of an invalid character, the only other reason QIIME 2 would have rejected the output would be if the first few sequences’ lengths did not match in the alignment.
P.S. Newer versions of QIIME 2 will write out the verbose output to a temporary file if something goes wrong automatically, so you won’t need to re-run with --verbose anymore!
The descriptive lines do start with the “>”, I don’t know why it isn’t showing up in the pasted version.
Also, just out of curiosity since this is so strange, I recently reran dada2 and the only difference was whether I trimmed 10 nucleotides off the front of the forward read, and suddenly I went from 30,000 features with the trimming to >300,000 features. Is that to be expected? And potentially related to this problem? Though the feature table does write out normally.
(The forum is eating that character to make a quote block I changed it to a code block so now it’s treated literally.)
Those sequences seem fine. Would you be able to send me that qza file (in a DM), I’d like to try and reproduce this locally. There is probably something up with MAFFT here.
That seems reasonable, the first few nucleotides are usually pretty poor quality. Does trimming the reverse in the same way reduce the feature count further? The 30,000 sounds much more reasonable. This shouldn’t impact the alignment however in practice (other than really poor runtime).
I received the same error while trying to use alignment mafft. I can also send you my qza file if necessary, but I will wait to see what you find out with Anna’s dataset. A debug .log file was also saved in case that is helpful.
@ebolyen, my .qza is 9.1 MB. Is it possible that the Feature ID format is causing a problem? I attached a screenshot based on the corresponding .qzv visualization. I started with a file provided by my sequencing lab.
I’ve been running MAFFT for the last day and it’s already taken about 155gb of memory. I suspect what is happening is that MAFFT is running out of memory and not returning an appropriate exit code. Looking into the changelog for MAFFT, there does seem to be a suspicious entry under v7.310 2017/3/17 which is newer than the MAFFT on bioconda. But I couldn’t find any further details about what specifically was wrong with the exit code.
So what I suspect is happening is that MAFFT is getting a failure from malloc and is accidentally returning 0 as the exit code, which means q2-alignment doesn’t realize anything has gone wrong. From there it tries to read the alignment fasta file which is where MAFFT would have put the results, but it is an empty file because MAFFT never finished and so QIIME 2 complains that the format looks wrong.
Looking further into your data-set (and the provenance) it seems that MAFFT is being asked to align ~400,000 sequences which is a pretty heroic task. The general workflow you have seems fine, but I think you’re losing out on a lot of the quality filtering that DADA2 can provide. I would probably recommend setting a trunc_len and a trim_left where the quality of your reads starts dropping significantly. This will results in far fewer features.
If you have some other motivation for avoiding this quality control, we’re going to hopefully have a vsearch plugin shortly which will allow you to cluster your features based on sequence identity. This would also solve the issue of too many features to align, although my understanding is improving the quality control is probably the more robust approach generally so I would only try clustering if you still had too many features after stricter quality-control.
P.S. I’m going to let the alignment run a little bit longer on our server as we haven’t completely run out of memory yet (about 50% of the way there) on the off-chance it finishes, but with an alignment of this size I’m not super optimistic.
@dgerrity, it is possible that the IDs are causing problems, but they shouldn’t be. Would you be able to post your log file? Otherwise, I suspect you are running into the same issues as @akknight216. How many features do you have that are being aligned?
After inspecting it, it does appear that a child-process of MAFFT is getting killed by a SIGKILL (almost certainly because it runs out of memory) and it also looks like MAFFT is not then returning that exit code, so our plugin has no idea anything went wrong and passes along the empty file as the result.
We can probably make the error reported by QIIME 2 better by upgrading our MAFFT binary.
But in either case, the actual solution here is reduce the number of features with some better quality-control (or get a really-really big computer to run this on). Without better quality-control you can expect the runtime of this step to be measured in weeks.
Linking to corresponding issues on our tracker, we’ll follow up here when they’re resolved. This won’t solve the general problem of aligning a huge number of sequences, but we can at least make the QIIME 2 framework detect empty files, and also try upgrading our version of MAFFT to handle exit codes more robustly. Thanks!
That is not expected behavior unless… do you have primers/linker/non-bio-nts in your reads? If so, you need to remove them (whether by trimming off the primer with trim-left or with a dedicated tool).
DADA2 (and any method that infers exact sequences) will interpret the ambiguous nucleotides that are in most primers as biological variation, which can artefactually multiply the number of features you get out at the end. For normal data, w/o primer sequence, the number of features should change relatively little when chopping off 10 nts.
Thanks for looking into this! The scary thing was that were were using the dada2 filtering- truncating at 250 for the forward and reverse. Our sequence quality drops off before that, but if I cut too much more off the reads won’t overlap.
I think we’ve addressed the QC issue in a different way- I finally figured out how to filter for the presence of a primer sequence first, discard the pair if one is missing a primer, and then use the trimming parameter F(10,250) and R(0,250). That has us down to about 15,000 features, which seems more reasonable. But by the end of that pipeline we only keep ~5% of our original reads. We loose 25% due to lack of primer, and other 70% after dada2 quality filtering.
This is a sign of something going wrong: There is a problem if you are losing a large majority of your reads during the dada2 denoising step. My suspicion is that you still have primers on your reads, which interferes with both denoising and chimera detection and can cause way too many reads to be lost, but non-overlap is another possibility.
The easiest way to diagnose your issue would be if you can share the fastq file from one of your samples?