Alignment error message when input sequences over 1000000

Hi, I am trying to align my sequences through running the mafft command. It comes out with this error.
inputfile = orig
1108985 x 392 - 150 d
nthread = 10
The number of sequences must be < 1000000
Please try the --parttree option for such large data.

How to add this option while running this command?

Hi @yiwen! The q2-alignment plugin, which wraps the mafft executable, doesn’t currently expose that MAFFT option. I created an issue to keep track of this feature, but this is probably low on our priority list (see my explanation below for details). As a workaround, you can run the mafft command directly with --parttree and import the alignment into QIIME 2 following this section of the importing guide.


Backing up a step, I wanted to note that’s a lot of sequences to align! We typically see users aligning much fewer sequences (e.g. hundreds or thousands) because these sequences are intended to be representative sequences. A representative sequence is a single sequence corresponding to a feature (e.g. ASV or OTU) in the feature table. Thus, if you’re attempting to align more than a million representative sequences, that means you have more than a million features in the feature table. That’s a lot more features than we usually see – the only time I’ve seen millions of features is when performing open-reference OTU picking on the Earth Microbiome Project data set, which likely suffers from the “OTU inflation” phenomena discussed in various publications.

If you can provide some more details about where these sequences came from (i.e. are they representative sequences or something else?), I can provide some direction. If you haven’t already reviewing these guides, I recommend checking out the Getting Started guide and the Moving Pictures tutorial, which includes topics such as representative sequences and feature tables.

3 Likes

Thanks for replying.

They were representative sequences based on the Moving Pictures tutorial. Finally, I tried to exclude more sequences in quality filtering step and it worked.

Thank you for providing the reasonable range of the number of representative sequences.

Cheers!

1 Like

Great, thanks for reporting back @yiwen!

Did you observe millions of representative sequences when using the Moving Pictures tutorial data set, or were you following the tutorial’s steps using your own data?

Using the Moving Picture tutorial data set with DADA2, we’re observing 759 representative sequences (i.e. 759 features) using QIIME 2 2017.12. I just want to make sure you’re not observing a larger number of representative sequences if you were using the tutorial data set. Thanks!

Hi,

In my last post, I was trying to point that I followed the tutorial’s steps using my own data. Sorry about that.

Cheers!

1 Like

No problem, thanks for clarifying!

Hi @jairideout, if I have to run mafft seperately on my fasta files then import the alignment, is there still a way to dereplicate the sequences? I know in the tutorial that step is performed before the alignment.

Thanks,
Zach

Hi @Zach_Burcham! There isn’t a way to dereplicate the aligned sequences after importing them as FeatureData[AlignedSequence]. The unaligned and aligned FeatureData sequences are intended to be representative sequences, which have already been pre-processed in some way (e.g. with dereplication, OTU clustering, and/or denoising).

If you haven’t already, I recommend reviewing the Getting Started guide and the Moving Pictures tutorial, which covers these topics, including how to perform a complete analysis starting with raw sequence data and producing “downstream” plots, statistics, etc.

If you have further questions, please create new forum topics for those. Thanks!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.