SourceTracker2 with QIIME2 result

Hi, community,

I want to use SourvceTracker2 to track my sequenced impaired stream water samples(pair-ended 250 bp), I also download data from EBI-ENA website which are in the same primer with which one I used. But the only difference is that downloaded data are from Earth Microbiome Project(cat feces, dog feces, human feces, etc.), which are single ended sequences(100~150 bp unequal).

I have used deblur trim the sequences to the same length(130bp/100bp) for all dataset, and produce the feature table, then SourceTracker was used in the analysis, however, I cannot detect 90% of sources. I am so confused.

Then I tried to match these sources to wastewater. Still, 80% sink microbiome is unknown, only <10% human feces are detected. I think it should be wrong. And I go back to check the feature table, only a few overlaps found.

Therefore, I am considering the situation of the QIIME2 deblur. It aims at finding the sequence variants. If myself sequenced reads are one base differs from the downloaded reads. Such as:

My A bacteria ASV: ATGCTGC
Downloaded A bacteria ASV: ATGCTG

These two sequences will be classified into two different ASVs, even they are from the same bacteria? That may cause SourceTracker cannot detect.

May I please have your help to solve this kind of problem? Different length of sequence data under the same primer will be classified into different ASV? Or is there any possibility if I used traditional OTU-classification methods?

Tons of thanks.

That is correct. ASVs that are substrings of another ASV will remain as separate ASVs, and that is usually how you would want this (with 1 nt difference maybe not, but we can imagine with 20 nt difference that those remaining 20 nt could diverge from existing ASVs; what if there are 2 equally good matches? you get into knotty territory if you start thinking about collapsing similar ASVs).

You could use OTU picking to collapse similar ASVs, though you will be losing some sensitivity there.

I would recommend trimming all sequences to the same exact length and then dereplicating those sequences. If that means you need to toss some sequences because they are too short, or need to trim longer sequences to 100 nt, so be it — that’s what you will need to do here because otherwise ASVs of different lengths will automatically be unique.

You may not be training on enough source types; the OTUs may not be source specific; or they are truly unknown. The fact that 10% of sources could be identified clearly means something was working. If this is not related to the length issues (my money is on length differences), this is probably a question for the sourcetracker developers.

I hope that helps!

@Nicholas_Bokulich Amazing answers! Thanks so much. :wink: Thanks so much for the help.


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.