Import amino acid sequences?

Hi everybody,

I was looking to import amino acid sequences with
qiime tools import \ --input-path aa.fna \ --output-path aa.qza \ --type 'FeatureData[Sequence]'
BUT an error message made me ask if I can actually run downstream analysis on clean amino acid sequences at all using Q2? I would like to (1) make a tree with MAFFT (2) apply the diversity-core metrics command on these data. I do have a biom table of the data converted to a feature table and would now just need the sequences as artifact. Any comments would be great, thanks!

steffen

1 Like

Hello Steffen,

Great question. I know Qiime was built to work with amplicon DNA sequences, so I’m also curious to know if it can work with amino acids. Mafft works with both, but I’m not sure about the other parts of the pipeline.

Can you also post the error thrown by this specific command?

Colin

2 Likes

@colinbrislawn and others.

Here the error message. Of course I can transform the sequences into lowercase letters with a quick one-liner, but would this have any effects later on/interpreted as masked sequences?

An unexpected error has occurred:
Invalid characters in sequence: [‘E’, ‘F’, ‘I’, ‘L’, ‘P’, ‘Q’].
Valid characters: [‘V’, ‘W’, ‘T’, ‘S’, ‘Y’, ‘H’, ‘D’, ‘C’, ‘A’, ‘B’, ‘R’, ‘K’, ‘N’, ‘-’, ‘.’, ‘M’, ‘G’]
Note: Use lowercase if your sequence contains lowercase characters not in the sequence’s alphabet.
See above for debug info.

Thank you!
steffen

Update: All-lowercase sequences in the .fna file generates this error:

Is there another possible way to import aa sequences? If not, exactly what algorithms of the qiime diversity core-metrics-phylogenetic command are not compatible with aa?

Hello Steffen,

That error makes it pretty clear that qiime 2 is throwing an error for every non DNA letter it’s seeing. So AA are being outright rejected.

Keep in mind that you can do MSA with MAFFT then tree building with FastTree2 totally outside of Qiime. Considering that qiime only accepts DNA, it looks like that may be your best option at this point.

Let’s see what the Qiime devs recommend. Maybe this is something they want to add, or maybe this is not part of the plan for Qiime.

Colin

1 Like

Thanks @colinbrislawn

The stand-alone applications will be helpful. Do you have an idea on how to calculate Unifrac outside of qiime 2? I am testing the Qiime 1 command beta_diversity.py right now. Sorry that this question goes beyond qiime 2. Any advice is highly appreciated though!

cheers,
steffen

I’ve used the Phyloseq R package to calculate UniFrac distances before, but let’s see what the Qiime devs recommend.

Colin

1 Like

I tested Qiime 1 and it worked perfectly since I am not actually working with the aa sequences but based on OTU numbers and a distance matrix.

Thank you @colinbrislawn.

I’m glad you got this working!

That makes sense that the other steps don’t care about AA vs DNA. The tree only has the names of the features, and UniFrac gives you distance between samples. So you can do MSA -> tree, then tree + sample table = unifrac distance matrix.

Very cool,
Colin

Hi @steff1088,
Good question, thanks for bringing this up. You and @colinbrislawn are correct - at the moment we don’t support sequences of amino acids, but it is something that we could support. Out of curiosity, would you mind describing your use case? I’m going to assume in this message that all of your amino acid sequences are homologous with one another (i.e., they are sequences of the same protein) - if that is not the case, let me know.

The reason that we currently don’t allow sequences of amino acids is that some actions in QIIME 2 assume that the sequence is nucleotide. This includes methods in the q2-vsearch, q2-alignment, q2-phylogeny, and feature-classifier plugin, and the feature-table tabulate-seqs visualizer (which creates links to BLAST sequences against a nucleotide database). To support this, we would need a new semantic type to indicate that the sequences are of amino acids (probably something like FeatureData[ProteinSequence]) so actions that assume a nucleotide sequence don’t mistakenly operate on an amino acid sequence, and then of course methods to operate on the type.

I think you and @colinbrislawn have mostly made it to this point, but what I would recommend for now is building your tree and feature table outside of QIIME 2, and then importing both of those. You can import your tree as illustrated here (this is for an unrooted tree - you can use qiime phylogeny midpoint-root to root that tree, or if your tree is already rooted you can import it with --type 'Phylogeny[Rooted]'. Any of the methods downstream of here should work fine with your data.

3 Likes

@gregcaporaso Thank you very much, that was very helpful. I have tested this approach and it does work after importing feature table and tree.

To the use of amino acids in qiime: The amino acid sequences are from the same protein, in this case MCR from the methanogenesis pathway. I am very interested in functional genes (other than MCR, NOS, NIR, AMO,…) and, for the purpose of high conservation, it made sense to me to conduct classification and diversity analyses on the amino acid level for these genes. Does anybody have more potential benefits from using amino acid sequences to look at?

Thanks for your input!

cheers,
steffen

1 Like

Thanks for the update @steff1088, glad it’s working for you! We’ll keep this in mind as we consider adding support for amino acid sequences in the future.

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.