representative seqs in a taxonomy affiliation

Hi,

I got a total of 10,554 rep-seqs through dada2 quality control from 4.5 Million of raw seqs. The rep-seqs were further subjected to the further clustering into 3,300 rep-seqs via Vsearch, with the 95% identity as parameter. And, then, the resulting 3,300 rep-seqs (sequence features) were classified, using following commands.

$ qiime feature-classifier classify-sklearn
--i-classifier //a user-made classifier.qza
--i-reads //3300-rep-seqs.qza
--o-classification //taxonomy.qza

The taxonomy.qza are visualized with kpcofgs ranks.
The species file showed the 3300 rep-seqs are affiliated with a total 74 species. Now I would like to see which seqs (or a representative seq) are involved in the individual species. Can I?

Thanks,

Hee-Sung

Hello @baehsung,

I got a total of 10,554 rep-seqs through dada2 quality control from 4.5 raw seqs. The rep-seqs were further subjected to the further clustering into 3,300 rep-seqs via Vsearch, with the 95% identity as parameter.

Dada2's purpose is to output denoised sequences that are already meant to represent true, error-corrected sequences from your sample. Clustering these afterwards is really only going to lower the resolution of your data. Do you have a specific reason for doing so?

Now I would like to see which seqs (or a representative seq) are involved in the individual species.

Can you explain more what you mean by this?

Thanks for your reply.

You are right, dada2 produces clean and rep-seqs, which is very good to diversity and taxonomy analyses.

In the taxonomic analysis of 10554 rep-seqs, the majorities appeared as unclassified species. I would like to see how they are related to the reference seqs from database and exhibit them in a phylogenetic tree. To make it easy, I tried to reduce the quantity of rep-seqs using Vsearch (95%, 90%, and so on). For seeing them in phylogenetic tree, i need to know which seqs were assigned as unclassified species.

It would be wonderful if qiime2 can align the rep-seqs and reference seqs, make a phylogenetic tree, provide a table listing the seqs affiliated with a specific taxon.

Best regards,

Hee-Sung

Hello @baehsung,

Let me make sure I understand your points. Correct me where I'm wrong:

  • When you say the majority of your representative sequences appear as unclassified species, you mean they are not classified to the species level, but they are classified to the genus (as opposed to completely unclassified)
  • You're clustering your representative sequences so that your phylogenetic tree has less nodes and is less overwhelming
  • You want a phylogenetic tree comprised of both your clustered representative sequences and the reference sequences used in the classification database
  • You want a table that for each representative sequence lists its taxonomic classification

Thanks for your discussion.

Q: When you say the majority of your representative sequences appear as unclassified species, you mean they are not classified to the species level, but they are classified to the genus (as opposed to completely unclassified)
A: My rep-seqs are assigned to unclassified level of Bacteria (accounted for 10 - 30 % within samples), and unassigned to any phylogenetic level (10-40%). I'm curious what (or who) they are.

Q: You're clustering your representative sequences so that your phylogenetic tree has less nodes and is less overwhelming.
A: Right. I did using Vsearch, reducing the rep-seqs number from 10,554 to 3,300, followed by taxonomic affiliation for them using my own classifie.

Q: You want a phylogenetic tree comprised of both your clustered representative sequences and the reference sequences used in the classification database.
Q: You want a table that for each representative sequence lists its taxonomic classification.
A: Yes for both.

Again, it's quit odd that such a large portion of our seqs were affiliated into unknown division and phylum. If those unclassified seqs are identified, it is possible trying to search closer seqs from database; thereby, updating classifier that give a chance to analyze their phylogenetic position more accurately.

If knowing them (unclassified seqs of majorities), it's also possible to examine their phylogenetic position on other softwares, using DNAs or/and the deduced AA seqs.

Hee-Sung

Hello @baehsung,

It sounds to me like you are unsatisfied with your taxonomic classification results and the phylogenetic tree you're describing is a troubleshooting measure, would you agree? I think it will be easier and more worthwhile to directly investigate why you're getting so many unclassified features. There are many discussions on this forum about that topic.

To create the discussed phylogenetic tree you'll have to first get the source sequences that were used to build the classifier. If you built the classifier yourself then you already know where they are. If not, then you'll have to do some research to see if and where the original sequences are hosted for the classifier. Then you would merge these sequences using e.g. qiime feature-table merge-seqs and align and construct a tree using the align-to-tree-mafft-fasttree pipeline. As far as a table that shows the classification for your features, that's what your taxonomy artifact does.

Thanks Colin,

I'll try to do that you suggested.

Best regards,

Hee-Sung

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.