Merging to runs same region but one using 2x150 and the other 2x250

Cotissima · March 1, 2020, 3:13am

Hi!
I am analysing faecal samples that were collected and sequenced in two consecutive years. All the samples are from different individuals. The problem is that between years, there were unexpected changes in the sequencing method. During the first year, samples were sequenced using Illumina 2x250 and in the second sample batch 2X150. They both cover the same region when paired. I am now a bit confused about how to manage my data.
initially, I run my data combining both data sets from the very beginning and trimming all the reads down to 150. Then, doubted whether that was the best approach as I thought I could lose valuable information. Must say this could be more psychological than logical as I know that after pairing, they both cover the same area.
Just to see what could be the outcome, I decided to re-run the samples but this time denoising separately and trimming accordingly. Then, I merged them, and to my surprised, I got around 200 more features than when running the samples combined but the core microbiome was greatly affected (in a way that doesn't make sense).

I was thinking on staying with my first approach but was wondering whether you have any suggestions? I am especially concerned about the trimming part (should I, should I not?).

I am assuming if you end up with length differences between the 2 groups, ones it merges, same sequences that deffer in length will be catalogue as different right?

SOS!!

jwdebelius · March 1, 2020, 10:10am

Hi @Cotissima,

My experience (and I'm trying to remember if/where there's a benchmark) is that your life will be better if all your reads are the same length or in the same lenght family. You get more resolution with longer reads (pro) but that also means that it changes the behavior when the denoised reads are aligned or clustered. So, I guess my advice would be to see what read lenght you can reasonably get out of the 2x150 reads, and then match your 2x250. It's unfortunately that they are different lengths, but at least you're not adding additional bias.

Best,
Justine

Cotissima · March 1, 2020, 12:08pm

Thank you so much, I have done what you suggested and reprocessed the 2x250 following 2x150. Now I want to merge both the rep-seqs and the feature tables. But when I tried for the first time, it seems that the features are just been added together in the same file but not really combined (if you now what I mean). This produces my merged repseq and feature table be inflated. Maybe I am using the incorrect argument (sum). I would like that those same sequences from each groups be identified as one rather that as a duplicate. How could I go around this?

jwdebelius · March 1, 2020, 4:12pm

Hi @Cotissima,

I think this is either because you have different sequences in the two datasets, or because your sequence lenght is different. Would you be missing to post or PM me the rep set so I could take a look?

Best,
Justine

Cotissima · March 2, 2020, 3:47am

Thanks for your quick response. I think now it works. I've got something that makes sense. I have attached both before rep-seq-merged-1.qza (74.0 KB) and after rep-seq-merged.qza (60.4 KB) fixing the length issue.

jwdebelius · March 3, 2020, 7:25pm

Hi @Cotissima,

The second looks better to me! Im glad you found a solution.

Best,
Justine

system · April 4, 2020, 1:25am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.