Data has little taxonomy classification

andrew_g · September 10, 2023, 3:52pm

Hello!

I've ran my data through the 18S pipeline and produced taxonomy classifications. When I view the data there is little to no data being assigned to anything meaningful. Most of the data that I am interested in is binned as unassigned or eukaryota.

One of the most overrepresented species that I would think should be classified is not. This info is derived from fastqc reports where read 1 and 2 one has this sequence at ~20% of reads.

A prior sample set was able to produce assignments beyond unassigned or eukaryota.

What may cause this occurrence? Please let me know what additional information that I can provide.

Thank you!
Andrew

colinbrislawn · September 10, 2023, 4:07pm

Hello again Andrew

That is strange!

Have you investigated what changed between these two samples / two cohorts? Different primers, different databases, and even different sequencing centers can cause unexpected changes that are eventually observed in the downstream taxonomy.

Did anything change in your bioinformatics pipeline?

andrew_g · September 11, 2023, 7:47am

Hi Colin,

Everything is the same from library prep to the analysis. We sequenced in-house on a MiSeq. The only variables I think of are the different sample sets and the number of samples within each set, and these being on separate sequencing runs. 43 samples for the well classified set, and 90 for poorly classified set. Would the amount of reads per sample per feature affect the classification this much? Even for a sequence that has about ~20% of all reads (raw reads, that is). I've lowered the confidence value to 0.5 and then 0.3, and that changed very little.

colinbrislawn · September 14, 2023, 4:11am

It will cause other issues, but classification should be the same... You can accurately classify a single sequence.

What is different about these samples biologically? It sounds like you expect some similar microbes and I'm also interested in how you expect them to be different and the reasons for that.

andrew_g · September 14, 2023, 3:04pm

Thank you for pointing that out. That makes sense now.

These samples are human stool. The difference being that they are each from individuals, and each set being from different locales. We are primarily interested in detecting helminths, mainly nematodes, as way to confirm presence/absence among discordance in two diagnostic methods.

The sequences that I am mentioning, that are found in high frequency aside from bacteria, are Blastocystis spp. derived. It’s just a point of reference because the first set (43) had no issue classifying that one, while the other did, and with many others. Taxa composition will vary, of course.

I am certain that I used the exact same classifier. I guess I should double check that. Is there any other information that would be helpful in figuring this out? It seems like it is not a matter of the sequencing. Rather the samples, some analysis pipeline mistake, or the reference database?

colinbrislawn · September 14, 2023, 4:02pm

It could be anything, which is why I'm asking more broadly about differences. I suppose it makes sense to double check that consistent methods stayed consistent (same primers, same database).

Is it possible that the second set that showed fewer classified reads did not have the expected microbe at all? What kinds of positive controls did you run?

system · October 15, 2023, 10:03pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.