Classifying Sanger sequences

Our lab does a fair number of culture-based microbial studies. I’m curious if there is a streamlined way to import Sanger sequences and classify their taxonomy in Q2? I used to do this in QIIME 1 because it was a convenient and code based method (and therefore replicable) for working with multiple reference databases (e.g. bacteria and fungi).

But the best I’ve been able to figure out in Q2 so far is to denovo cluster the sequences at 100% similarity so I can have a valid input for the taxonomic classification steps and then jump through some hoops to get the data back into R where I can relink the original sequence ID with it’s taxonomic classification from Q2.

A best case scenario would be importing a multifasta that I can then classify and have the original fasta ID retained on the taxonomy table. Any thoughts?

Hey there @sformel!

How did you import these data? What steps in QIIME 2 did you take prior to de novo clustering? Did you use vsearch dereplicate-sequences? If so, it looks like that replaces feature IDs with hashes of the sequence, and it doesn’t look like we have enabled any functionality to disable that.

Let us know how you processed these files and we can try to provide some more specific feedback - thanks!

Thanks for taking the time to help me! I will request a new feature to turn off replacing the feature IDs with hashes of the sequence, that would be very helpful for what I’m trying to do.

I did use vsearch dereplicate-sequences to dereplicate my sequences after using qiime tools import to import the original multifasta.

If I don’t go through the denovo OTU picking step, I get a memory error during taxonomy classification. I’m using the SILVA classifier with sklearn. This is a little strange because I only have ~ 400 sequences and 64Gb of RAM. However, when I go through the OTU picking step, it works without problem.

After unzipping the artifact that results from dereplication, I noticed that each fasta sequence has been broken into lines of 80 characters each. The original file was unchanged after being imported (804 lines, which makes sense) the new one after dereplication is 6177. Could this be causing problems?

I was trying to not overwhelm you with error logs or excessive information, but please let me know if there is anything I can give you that would clarify what I’m seeing.

Sorry, one other piece of information that might be useful. When I try to classify taxonomy with my multifasta after importing it with with qiime tools, I get the error:

Plugin error from feature-classifier: Argument to parameter 'reads' is not a subtype of FeatureData[Sequence]

I couldn’t figure out what piece of information was missing so I moved onto the dereplication and clustering steps to see what else I could come up with. If I could use this original fasta as the input for classification, and avoid dereplication/otu picking, that would be ideal.

The memory issues are more likely related to the size of SILVA, rather than the size of your sequences. Several discussions and strategies can be found here:

Oh bummer. I don't think this is related to the memory issues. @colinbrislawn just reminded me that we have an open issue for this: Set –fasta_width 0 on relevant actions · Issue #48 · qiime2/q2-vsearch · GitHub, in case you are curious.

That file was probably imported as type SampleData[Sequences], which is correct for what these data represent. Taxonomic classification needs to operate on FeatureData[Sequence], which is the same "kind" of data (FASTA of sequences), but represents a different axis of the equation (samples instead of features). This is where semantic types really help guide you to the right "next step" with your analysis! In your case, dereplication was the right choice, since it yielded FeatureTable[Frequency] and FeatureData[Sequence] artifacts for you.

Keep us posted! :qiime2: :t_rex:

Thanks again for the thoughtful and quick responses. I’ll see what kind of progress I can make and send an update in the next few weeks.


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.