Resolving species in greengenes2

Hello @wasade

I want to use species names in my analysis but not sure how reliable they are.

I found 8 ASV identified by greengenes2 non-v4 as Escherichia fergusonii and 2 ASV as E. albertii in my data, V3-V4 sequences processed by dada2 paired-end protocol.

My colleague dismissed the findings saying it is impossible to reliably classify them with V3-V4 data. When I BLAST for example E. albertii it show that first has asv has 100% sequence similarity to E. albertii but as well with some E. coli strains and second only 99.77% with E. coli and E.fergusonii. Alignment shows that E. albertii are 2 nt apart from each other and only 1 from most of E. fergusonii.
In greengenes site E.albertii page says "16S V4 region is not species specific"
and when i try to check "Full Length Containing an ASV" the links do not work,
Genbank says:
The following terms were not found in Nucleotide: 000155105.1, RS-GCF-000155105.1-NZ-CH991859.1.

As I could not verify I would probably drop all species.
All Escherichia have similar distribution in my data so I would not lose much predictive power here, but then I also miss other taxa, for example Phocaeicola_A_858004 which have shows different associations with health traits between P. vulgatus and P. dorei and few others.
These seem to agree better with BLAST results and cluster together in clustal omega.

Ideally I want to prove that names are assigned correctly, or find a way to sort out unreliable names.
Would be glad to hear any suggestions.

Attaching the sequences and asv IDs, I prepended them with shortened identified names for easier comparison.

Escherichia_seq.txt (4.6 KB)
Phoca_seq.txt (8.7 KB)

Hi @Marsel_Murzabaev,

In Greengenes2, we've assessed uniqueness of 16S only for 515F-806R with the EMP primers. Even if we do observe the region is species specific, it is in the context of the current reference database and inclusive of inherent taxon sampling bias, so it cannot be a definitive statement of species assignment.

Remember that even full length 16S is not assured to be a reliable species marker. See e.g., Wang et al..

If you'd like to do a similar assessment as what we did with V4, you could use q2-feature-classifier's extract-reads action with your primers on the GG2 full length data. I think these could then be used with q2-greengenes2s clade_v4_asv_assessment action instead of the V4 data. However, I have not tested this outside of the EMP primers.

Thank you for noting the broken links on the GG2 website. I need to revise the regular expression that matches the sequence identifiers. In the example provided, that looks like it corresponds to this record.

All the best,

