Importing protein sequences into version 2021.2

mcreyno2 · March 2, 2021, 10:51pm

Hi q2-team!

I was excited to see the 2021.2 version including q2-types that can read proteins/amino acid sequences. But I am getting errors importing an unaligned .fasta file, which contains typical fasta headers and amino acid sequence lines. I reference this announcement about protein types (QIIME 2 2021.2 is now available).

Can I confirm the formats/q2-types listed in the announcement are immediately available? For instance, when running a qiime tools import command, it yields an error: No format: ProteinFASTAFormat. The import command I ran is below

$qiime tools import --type 'SampleData[Sequences]' --input-path protein_fasta_qiime2_test/non_chimeric_prot_corr_q2_import_test.fasta --output-path sequences.qza --input-format ProteinFASTAFormat

Let me know if I can provide any other details to help clarify the error or workflow. For example, running the command below does not yield a list with either "ProteinFASTAFormat" or ProteinSequence" or "FASTAFormat"... only "DNAFASTAFormat"

$qiime tools import --show-importable-formats

Thanks in advance for the assistance!

Cheers,

Mark

misialq · March 3, 2021, 1:27pm

Hi @mcreyno2,

Really exciting to see the first protein format users - thanks for trying that out!

Yes, they should be available as of version 2021.2.

That is unexpected. Just to confirm, I tried it now with a fresh installation of QIIME2 but in the list of formats I can see both, ProteinFASTAFormat and AlignedProteinFASTAFormat. Could you maybe double-check that you ran that command with the right QIIME2 version? Just execute qiime info to see all the versions.

Protein sequences (unaligned and aligned) can only be imported as FeatureData[ProteinSequence] or FeatureData[AlignedProteinSequence]. So in your example you would need to do something like:

$ qiime tools import --type "FeatureData[ProteinSequence]" --input-path protein_fasta_qiime2_test/non_chimeric_prot_corr_q2_import_test.fasta --output-path sequences.qza

To confirm that those two types are available in your installation you can also run:

$ qiime tools import --show-importable-types

Hope that helps. Let us know in case it still doesn't work.

Cheers,
Michal

mcreyno2 · March 9, 2021, 5:36pm

Hi @misialq,

Thank you for your reply. I am eager to work with protein seqs in qiime2 now

Your assistance was very helpful. I learned that what I thought was qiime2-2021.2 was actually built under a previous qiime version (10.2019)... so that explains why those importable types and formats were missing. Now that version is resolved, I was able to test your recommended suggestion.

Indeed, I am now able to import amino acid rep-seqs with no issues! This is tremendous progress and allows me to compare and contrast between DNA/AA codes for a functional gene amplicon of interest.

I will try it when I have more time... but in theory, I could train a naive bayes classifier using amino acid codes and classify these AA rep seqs with it? Correct?

Thanks again. I really appreciate the help and new updates from the qiime2 team!

Mark

Nicholas_Bokulich · March 9, 2021, 6:24pm

Alas no — because those methods do not yet accept the appropriate semantic type.

This would be a fairly "easy" fix (provided that the underlying methods can handle AA sequences) if you are not afraid of a little python

If you want to contribute to q2-feature-classifier and any other plugins, let us know — we can get you started with the changes that are needed to support these new protein types, and we could really use a hand in testing out what methods can support them.

mcreyno2 · March 9, 2021, 7:29pm

Hi @Nicholas_Bokulich,

This makes sense. Upon re-reading the 2021.2 announcement, it does seem like importing protein seqs via .fasta was the first step (with plans to expand which plugins accept protein seq artifacts?) So perhaps I was getting too ahead of myself what's capable in current qiime2-2021.2 framework.

Hmmm.... this sounds like an interesting and useful task. I do not have training in python coding... but would be interested in assisting in anyway I can! I have glanced at the classifier.py script on the Github page (q2-feature-classifier/q2_feature_classifier at dev · qiime2/q2-feature-classifier · GitHub). I see spots where one can include the new types Mihal referred me to (FeatureData[ProteinSequence]). But I doubt its that easy.

Regardless, I am availlable to test/compare the final qiime2 products. Additionally, I can ask around to see if there's some python experts I know that might be interesting in assisting.

Nicholas_Bokulich · March 10, 2021, 5:41am

Unfortunately a little bit, yes — the new semantic type just means that QIIME 2 has a notion of what AA sequences look like, but support for this type needs to be manually added to individual plugins.

We do actually have one plugin that accepts this type at the moment and does interesting things:
https://library.qiime2.org/plugins/q2-protein-pca/30/

It might be really helpful to have you "kick the tires" after we add some experimental updates... I am not sure what our timeline is on this, but @misialq and I might get in touch at some point to see if you want to alpha test this for us.

misialq · March 11, 2021, 3:09pm

Hey @mcreyno2,

As @Nicholas_Bokulich pointed out, it would be fantastic if you'd be willing to test some of the protein-based features, once we add them in the future. For now though, do you think you could tell us just a bit more about your use case? Sounds like you want to train a classifier based on protein sequences - any particular reason to use those rather than DNA? What would your classification target be? Are there any other places/plugins/actions where you see application for protein sequence input?

Thanks!

ahfitzpa · March 18, 2021, 3:21pm

I am also interested in using the naive bayes classifier on amino acids sequences. I work with Caliciviridae and for norovirus genogroup GII, classification is usually based on amino acid sequence rather than nucleotide due to high genetic diversity. The classification is good but I suspect would be improved by using aa! I need 100% agreement with Norovirus Genotyping Tool

mcreyno2 · March 18, 2021, 5:25pm

I would be very interested in this! I have per-feature rep reqs files I can use for any alpha testing. I'm sure I can coerce DNA amplicons "off the sequencer" to amino acids sequences if needed also for any testing of clustering pipelines, etc. Feel free to communicate with me in a private message and I can provide my email address. I will try the q2-protein-pca plug in the mean time.

Sure! Yes, I have a database of amino acids coding for a subunit of the enzyme involved in methane generation from archaea (mcrA). Similar to @ahfitzpa, I am following literature suggesting that DNA coding for a non-universal protein should be classified as amino acids since the amino acid sequences evolve faster than methanogenic 16S genes, for example. I am still working on a thorough comparison studying this gene as DNA or amino acid sequence (@Nicholas_Bokulich, I, and another had a good chat about this previously on a qiime2 forum post - Will qiime2 support functional gene analysis in the future?).

I currently use the Fungene Pipeline which has a step to convert DNA amplicons to amino acids, correcting for frameshifts in the reading frame, and provides nearest-neighbor classification. This seems robust... but I'd be interest in q2-classifier for increased confidence when training a classifier on the amplicon region the primers cover. Another reason is that nearest-neighbor classification always classified every rep seq to the full ranking of the database (to genus level). I like the naive-bayes classifier because it seems to stop at distinct ranks, if it has no confidence it can go deeper taxonomically.

Other q2-plugins in which accepting amino acid sequences could be advantageous might be q2-diversity, longitudinal (and associated plugins), Vsearch, ANCOM, and q2-phylogeny. Of course, all aspects of qiime2 would be great, but here are a few examples without providing a laundry list

Thanks q2-team for your assistance. And welcome @ahfitzpa to the conversation and to qiime2 forum

Nicholas_Bokulich · March 18, 2021, 7:09pm

Thanks @mcreyno2 and @ahfitzpa ! These sound like very good justifications.

The actions in these plugins mostly take feature tables and phylogenies, not sequences... so technically these already support protein sequences if these are represented in a feature table. In your case, you are converting DNA to protein, so the base feature table should work (possibly collapsing the table based on the protein translation?). Phylogeny is another issue, see below...

Alas, vsearch does not support protein sequences so we cannot help there. See here: Vsearch for clustering protein sequences? · Issue #427 · torognes/vsearch · GitHub

same as for q2-diversity... a feature is a feature is a feature...

we have been talking about this a bit for a while with @misialq and @SoilRotifer so getting support for protein alignments/phylogenies with q2-alignment + q2-phylogeny might be possible some day.

Awesome! We will be in touch. Not sure when, probably some day when you least suspect it

mcreyno2 · March 18, 2021, 7:10pm

I was able to progress on this just now with my per-feature amino acid rep seqs file. It was really great to finally get a q2-visual using protein sequences! And wasn't too foreign in the context of typical q2 commands. But am still wrapping my ahead around the residue conservation of the pca loadings linked to the crystal structure from protein database.

misialq · March 24, 2021, 4:27pm

Glad to hear you tried it out, @mcreyno2! As to the crystal structure mapping, I'd consider this still a bit experimental. It may be finicky at times, particularly for large structures. Feel free to reach out if anything is unclear or if you have any feedback on the usability (perhaps in a separate thread)!