Found 139 sequences that are exactly identical to other sequences in the alignment

I am running a command to create a tree using the raxml-rapid-bootstrap pipeline available in the tutorial. When I run the command it generates a warning message: Found 139 sequences that are exactly identical to other sequences in the alignment , and provides an alternate file without duplicate sequences. My questions:

Can I continue with this command, even with the duplicate sequences? (it is still running)
How do I use the alternate file (given in fasta format) ?

Here is the command I had run:
qiime phylogeny raxml-rapid-bootstrap
–i-alignment Clinical/masked-aligned-rep-seqs.qza
–p-seed 1723 **
–p-rapid-bootstrap-seed 9384
–p-bootstrap-replicates 100
–p-substitution-model GTRCAT
–o-tree Clinical/raxml-cat-bootstrap-tree.qza \

–verbose

Thanks

Hi @emezhibovsky,

You can indeed continue with your analyses. If I remember correctly raxml:

  • Simply ignores them during tree searches, as they are redundent data.
  • However, these should still appear in your final tree (likely as a polytomy).

Often this is the result of masking your sequence alignment. The removal of masked columns can reduce noise in the alignment, which is sometimes beneficial. However, masking can also drastically increase the similarity of the sequences to each other… often making them identical.

Masking is still a somewhat touchy subject as outlined here.

One extra tip: try adding --p-raxml-version AVX2 to your command. This will greatly speed up your run time.

Hope this helps!
-Mike

2 Likes