dada2 merging and comparing different data sets

Alya_Heirali · February 1, 2021, 5:38pm

Hello everyone,

I would like to compare two different data sets (ROMA1 and ROMA2) looking at the gut microbiome of patients with cancers undergoing various treatment regimens.

ROMA1 was sequenced using an Illumina MiSeq platform. The reads are 150bp PE . They used barcoded 515F and 806R primers and the V4 region was sequenced.

ROMA2 was sequenced using an Illumina MiSeq. The reads are 300bp PE . The 5155F (barcoded) and 806R primers were used to sequence the V4 region.

I ran the two data sets separately in dada2.

For ROMA1 I used the following filtering parameters

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(142, 140), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread= TRUE) # the primers were not in the data hence I didn't need to remove

then I proceeded with the steps posted in the dada2 tutorial, removed chimeras and formed an otu table.

For ROMA2 I used the following filtering parameters

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft = 28, trimRight =20, truncLen=c(260, 220), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread= TRUE)

I then made my otu table for ROMA2.

Next I merged the OTU tables using mergeSequenceTables and then assigned taxonomy using the SILVAv138 database.

My questions/comments:

Is it appropriate to compare the two data sets given these differences?

I tried cutting the ROMA2 reads to 142,140 to make them more comparable but had difficulties merging my forward and reverse reads.

I think it is appropriate as I am rarefying/normalizing my data to account for differences in sequencing depths of the two data sets but just wanted to confirm.

Is it appropriate to merge ASVs belonging to the same species?

Any thoughts/tips would be much appreciated.

Thanks,
Alya

colinbrislawn · March 7, 2021, 8:39pm

Hello Alya,

Welcome to the forums! Sorry to keep you waiting on an answer. It's been quite the February!

Yes! See this great discussion on GitHub about processing data from multiple runs.

As long as the region sequenced is the same after merging, you can pick the truncLen settings that work best for your data! See this comment from the dev:

It's fine to use different truncLen parameters as long as the read pairs still overlap and are mergable at the end. The truncLen setting doesn't affect the merged amplicon region, it just affects the amount of overlap between the two reads.

I think the method you describe here should work great!

That's a good question. I think the consensus on the Qiime 2 forums is to keep your ASVs separate for most downstream analysis so that you can make use of their subspecies resolution. However, it could be convenient to talk about 'species' instead of 'ASVs of the species' in the discussion section, and I like to merge features by taxonomy when I make stacked bar plots. Your call!

Let us know if you have any other questions!
Colin

P.S.

Microbiome Datasets Are Compositional: And This Is Not Optional