Concern about merging feature table artifacts in the tutorial of Fecal microbiota transplant (FMT) study

“We’re therefore ready to merge the artifacts generated by those two commands. First we’ll merge the two FeatureTable[Frequency] artifacts, and then we’ll merge the two FeatureData[Sequence] artifacts. This is possible because the feature ids generated in each run of denoise-single are directly comparable (in this case, the feature id is the md5 hash of the sequence defining the feature).” (copy from the tutorial)

I have a concern about the merging: as mentioned in the tutorial, dataset A and dataset B are separately analyzed by dada2 and then merged. While I think the OTUs won’t be the exact same as if the sequences from dataset A and B are merged first and then analyzed together by dada2 because of the OTU clustering. In other words, “OTU-a” in dataset A and “OTU-b” in dataset B may belong to the same “OTU-ab” if you merge the sequence data at the first place, though the OTU-a, OTU-b and OTU-ab are almost identical sequences.

So we may get a inflated number of OTU if do the merging as the tutorial mentioned. The inflated OTU population won’t affect the result much after collapsing, but it will harm the diversity analyses and others at least? Is it? Not sure if I understand this correctly.

Thank you so much.

Cheng

Hi @gc26762524,

That is a completely correct understanding of de-novo OTU clustering and the perils of merging such OTUs. However DADA2 is a denoising algorithm so no clustering actually occurs (you could for instance perform clustering with vsearch afterwards if you were interested in that). Instead we get amplicon sequence variants which are error corrected “true” sequences. It does some very fancy math to work out what the true sequence should be.

What’s important for merging with respect to DADA2 is making sure DADA2 has the same opportunity to correct to the same sequence. This means that the lengths and starting positions need to be the same between DADA2 runs. This is easier with paired-end data as the 3’ ends overlap (ideally) and so between runs you’ll always have the same primer pairs (ideally) so there’s a bit less to coordinate. But its not hard to do this for single-end either.

Hope that helps!

3 Likes

Hi @ebolyen, Thanks for the help, which is always very informative. I took it granted for OTU clustering. I got a better understanding of the entire tutorial after reading you post and DADA2 paper. Thanks again.

Cheng

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.