Greengenes2 database methods, which one to use?

Hi everyone,

Greengenes2 has just released the updated database earlier this year. Following this tutorial, I tried to use the reference database with two methods: naïve Bayes classification with the pre-constructed full length Greengenes2 classifier and phylogenetic classification for non-V4 data. My data descriptions are as detailed below:

  • 16s rRNA, V1-V2 region
  • Paired-end reads, 320bp reads after denoising

Something weird happened because the number of reads I got from both methods (Naïve bayes and Phylogenetic) is completely different. It is 58,243 for the former and 11,456 for the later. The phylogenetic method yielded less than 20% of the Naïve bayes classifier, which is something that I did not expect. I am suspecting extra filtering step that was described in the phylogenetic classification tutorial might be why.

Apart from a huge number of read loss, I also got 488 unclassified reads from the Naïve bayes classifier, while only got 2 unclassified reads from the alternative method. I got 1890 unique genus level taxonomies from Naïve bayes classifier and 1864 from the phylogenetic method, with 1476 overlap between both.

I also checked the alpha diversity index (Shannon) and as expected, they are statistically different given different number of ASVs, which indicate that the significantly lower number of reads in the phylogenetic identification method is influencing the community composition index.

Attached below is a visualization of the top 20 genera to illustrate the difference in the number of reads generated by the two methods affected the downstream metrices.

I have three questions:

  • What is happening to the number of reads for the phylogenetic method that I lose ~80% of them?
  • I am expecting both methods to yield approximately similar result in identifying the taxonomy since both come from the same reference database. What makes them different?
  • Which of the methods do you recommend to use? The tutorial encourages to use the phylogenetic method, but I am unsure which one to use now.

Many thanks in advance!

2 Likes

Hi @emntsha,

The phylogenetic method assumes V4 ASVs 90/100/150nt fragments right now as that's the primary set of what was placed. I would not expect any V1-2 fragments to place, unless the depositors of data in Qiita were inaccurate about their preparation information. We are working on allowing placement with other variable regions but that isn't ready yet.

For your data, I would recommend either Naive Bayes against full length or closed reference clustering w/ the non-v4-16s action.

Best,
Daniel

3 Likes

Hi @wasade,

thanks for your reply. I used the closed reference clustering with non-v4-16s action for the phylogenetic method and the corresponding result I described here was generated from using non-v4-16s action. Or do you mean a different thing?

Also, since you said that any V1-2 fragments are not expected, does it mean that the phylogenetic methods remove the reads that are not in the database because the database/set does not contain data for V1-2? It was also mentioned that the non-V4-16s action will perform the closed reference OTU picking against the full length 16s sequences in GG2, so shouldn't V1-2 be covered in the full length sequences?

Please excuse my seemingly absurd questions since I am completely new in the field and trying to understand how the pre-processing works.

Hi @emntsha,

I would not expect V1-2 fragments to exist in the tree already, so the phylogenetic taxonomy (e.g., using filter-features) should not work.

Most of the backbone sequences should have V1-2 but it's not unusual for "full length" 16S to miss some terminal positions. Do you have an example set of V1-2 sequences which do not recruit under closed reference to the backbone?

Best,
Daniel

Hi @wasade,

I can look for them but right now I don't have any such example. If the problem is due to less V1-2 fragments in the backbone, the two methods have 1476 identified taxa overlaps (78% of what were identified from Naive Bayes classifier), hence, not really explaining the low number of reads. What could be the reason for the low number of read beside less fragments of V1-2 in the backbones?

1 Like

Hi @emntsha,

What would be easiest is if some data could be shared so I can directly evaluate them.

Best,
Daniel

Hi @wasade,

I can send you some data to your email. Could I have your email?

Hi @emntsha,
We usually like to keep forum related questions on the forum. Please send your data as a DM to @wasade, if you do not feel comfortable sharing publicly.

2 Likes

Hi @cherman2,

Ah thanks for your info. I wasn't aware that there is a DM feature in the forum. Will do so!

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.