Hello @wasade
I want to use species names in my analysis but not sure how reliable they are.
I found 8 ASV identified by greengenes2 non-v4 as Escherichia fergusonii and 2 ASV as E. albertii in my data, V3-V4 sequences processed by dada2 paired-end protocol.
My colleague dismissed the findings saying it is impossible to reliably classify them with V3-V4 data. When I BLAST for example E. albertii it show that first has asv has 100% sequence similarity to E. albertii but as well with some E. coli strains and second only 99.77% with E. coli and E.fergusonii. Alignment shows that E. albertii are 2 nt apart from each other and only 1 from most of E. fergusonii.
In greengenes site E.albertii page says "16S V4 region is not species specific"
and when i try to check "Full Length Containing an ASV" the links do not work,
Genbank says:
The following terms were not found in Nucleotide: 000155105.1, RS-GCF-000155105.1-NZ-CH991859.1.
As I could not verify I would probably drop all species.
All Escherichia have similar distribution in my data so I would not lose much predictive power here, but then I also miss other taxa, for example Phocaeicola_A_858004 which have shows different associations with health traits between P. vulgatus and P. dorei and few others.
These seem to agree better with BLAST results and cluster together in clustal omega.
Ideally I want to prove that names are assigned correctly, or find a way to sort out unreliable names.
Would be glad to hear any suggestions.
Attaching the sequences and asv IDs, I prepended them with shortened identified names for easier comparison.
Escherichia_seq.txt (4.6 KB)
Phoca_seq.txt (8.7 KB)