Data has little taxonomy classification


I've ran my data through the 18S pipeline and produced taxonomy classifications. When I view the data there is little to no data being assigned to anything meaningful. Most of the data that I am interested in is binned as unassigned or eukaryota.

One of the most overrepresented species that I would think should be classified is not. This info is derived from fastqc reports where read 1 and 2 one has this sequence at ~20% of reads.

A prior sample set was able to produce assignments beyond unassigned or eukaryota.

What may cause this occurrence? Please let me know what additional information that I can provide.

Thank you!

Hello again Andrew :wave:

That is strange!

Have you investigated what changed between these two samples / two cohorts? Different primers, different databases, and even different sequencing centers can cause unexpected changes that are eventually observed in the downstream taxonomy. :bar_chart:

Did anything change in your bioinformatics pipeline?

Hi Colin,

Everything is the same from library prep to the analysis. We sequenced in-house on a MiSeq. The only variables I think of are the different sample sets and the number of samples within each set, and these being on separate sequencing runs. 43 samples for the well classified set, and 90 for poorly classified set. Would the amount of reads per sample per feature affect the classification this much? Even for a sequence that has about ~20% of all reads (raw reads, that is). I've lowered the confidence value to 0.5 and then 0.3, and that changed very little.

It will cause other issues, but classification should be the same... You can accurately classify a single sequence.

What is different about these samples biologically? It sounds like you expect some similar microbes and I'm also interested in how you expect them to be different and the reasons for that.

Thank you for pointing that out. That makes sense now.

These samples are human stool. The difference being that they are each from individuals, and each set being from different locales. We are primarily interested in detecting helminths, mainly nematodes, as way to confirm presence/absence among discordance in two diagnostic methods.

The sequences that I am mentioning, that are found in high frequency aside from bacteria, are Blastocystis spp. derived. It’s just a point of reference because the first set (43) had no issue classifying that one, while the other did, and with many others. Taxa composition will vary, of course.

I am certain that I used the exact same classifier. I guess I should double check that. Is there any other information that would be helpful in figuring this out? It seems like it is not a matter of the sequencing. Rather the samples, some analysis pipeline mistake, or the reference database?

It could be anything, which is why I'm asking more broadly about differences. I suppose it makes sense to double check that consistent methods stayed consistent (same primers, same database).

Is it possible that the second set that showed fewer classified reads did not have the expected microbe at all? What kinds of positive controls did you run?