loss of taxonomic resolution at L4

MichelaRiba · April 14, 2023, 12:16pm

Good afternoon,

I am studying some different results from microbiome 16S analysis.

I am experiencing some low resultion at levels under L4, in specific I can find unspecified "Bacteria" with no additional specification at more than 10% frequency using joined reads (12,000 due to filtering low quality). Is this normal?

What could be a standard acceptable numer of starting sequences for proper classification?
What could be the rate of unclassified bacterai acceptable?

I thank you so much,

Michela

crusher083 · April 14, 2023, 1:15pm

Hello!
This would be highly dependent on the studied environment, quality of sequencing, and quality of the reference database.
There are no guidelines for this, it is highly variable as shown in the figure from Cultivation-independent genomes greatly expand taxonomic-profiling capabilities of mOTUs across various environments | Microbiome | Full Text

Cheers
Valentyn

MichelaRiba · April 14, 2023, 1:29pm

Hi,

thanks. This is quite encouraging. In any case my personal feeling is that starting from good quality sequences, even coverage, ... would help also putting some rule of thumb expected values.

I thank you so much,

Michela

VincentVasquez · April 26, 2023, 3:09pm

It is not uncommon to observe low resolution at taxonomic levels below L4 (family level) when analyzing 16S microbiome data. This is because 16S rRNA gene sequencing is not always able to provide sufficient resolution at lower taxonomic levels due to sequence conservation and variability within and between bacterial species. Additionally, there may be limitations with the reference databases used for taxonomic classification, leading to unclassified or unidentified bacteria.

The number of starting sequences required for proper classification can vary depending on the sequencing platform, sequencing depth, and study design. However, as a general guideline, a minimum of 10,000-20,000 high-quality sequences per sample is recommended for reliable taxonomic classification using 16S rRNA gene sequencing.

Nicholas_Bokulich · April 27, 2023, 4:56am

Hi @MichelaRiba ,

I am writing to echo @crusher083 's advice that there is no "rule of thumb".

Taxonomic classification accuracy does not depend on the number of input query sequences in any way. You could classify a single query sequence without impacting classification accuracy.

The classification accuracy is going to depend rather on the marker gene/region chose, the comprehensiveness and quality of the reference sequence database, and the length and quality of the query sequences.

As @crusher083 mentioned, this will depend on the sample type. Unclassified reads means that those reads did not match anything in the database. Most of the time in 16S surveys this is because these reads hit non-target DNA (e.g., host DNA), not because they are novel species.

Good luck!

MichelaRiba · April 28, 2023, 11:07am

Hi,
with regard to the classification accuracy I was referring to the impact of uneven coverage on the possibility to classify sequences, treated in a different post. I experienced the loss of classification power (mostly OD1/unclassified)
I solved using a subsampling strategy upon excluding frank outliers, I was however concerned about the fact that I used "only" 30,000 starting sequences per sample dropped to nearly 12,000 upon filetring,...

Thanks a lot,

kind regards,

Michela