Diversity metrics pre- and post- filtering different

Hi all,
I’m interested in your thoughts.
Originally, I had done all of my microbiome analysis with mitochondria, chloroplasts, and contaminant sequences removed.

However, after doing my analysis, I decided that I also wanted to filter my table to remove any features that were not annotated at at least the phylum level. I then re-did the analysis and re-computed diversity metrics.

However, I am now concerned about how this has changed my diversity metrics, and I’m wondering what your thoughts on this process are. For example, previously across all samples I had a mean Jaccard Distance of 0.95, and now I have a mean Jaccard Distance of 0.68. I understand that the samples that are not annotated at the phylum level are not very informative taxonomically, but I’m unclear whether these features are adding resolution, or adding noise. In other words: which diversity metrics are more reliable?

In general, the comparisons that were significant before are significant now, but one of my key takeaways from my original analysis was that my samples had a very high Jaccard Distance. By filtering out these reads, am I artificially changing the diversity metrics?

Thanks so much!


Hi @clairewill22 ,
Just based on this bit of information, I’d say that I trust the results more after filtering.

In general, if sequences fail to classify to at least phylum level, they are more often than not junk DNA, e.g., non-target, PhiX, index jumpers, or host DNA slipping through. This depends a lot of course on the target region etc.

I recommend using NCBI BLAST (make sure to exclude uncultured) to find the closest match for a few of these unclassified reads to see if this really is junk or true signal that is failing to classify for some other reason.

Probably not. If my hunch is correct, then keeping those reads will artificially alter diversity metrics and should be filtered. But BLAST as I recommend just to rule out, e.g., technical errors leading to unclassified reads.

Good luck!

1 Like

Thank you, @Nicholas_Bokulich!

I looked through the reads that were being filtered out, and you are right in that some of them were host DNA, and some were just unknown environmental DNA. A minority appeared to be distantly related to known bacteria, but yeah for the most part it was junk.

I appreciate the advice!!!


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.