Analysing with multiple public datasets

Shinthotrang · June 5, 2025, 11:06am

Greetings everyone!

I have questions regarding analysing with multiple public datasets downloaded from SRA database.

These datasets have the same target region (v3-v4), same length 600.

Based on what I have read on: re-analysis different dataset with single end or paired end reads

Is it correct that with 2+ datasets, we can use different trunc len parameters, as long as the other parameters are the same

Example:
Dataset 1: 30 samples, V3-V4, 600 amplicon.
qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --p-trim-left-f 5 --p-trim-left-r 5 --p-trunc-len-f 287 --p-trunc-len-r 210 --p-n-threads 0 --o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats stats-dada2.qza --verbose

Dataset 2: 12 samples, V3-V4, 600 amplicon
qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --p-trim-left-f 5 --p-trim-left-r 5 --p-trunc-len-f 275 --p-trunc-len-r 230 --p-n-threads 0 --o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats stats-dada2.qza --verbose

Additionally, if I have 1 dataset with V4 region, and 1 dataset with V3-V4 region. It is not recommended to combine these two together after denoising separately right?

And I have 1 last question:
With the same target region, V3-V4, however, the amplicon's length is different (let's say 500 nt and 600 nt). Is it possible to combine the DADA2 results with different length?

timanix · June 5, 2025, 12:07pm

Hello!

That is correct! You can experiment with the "--p-trunc-len" parameters if all the other settings are identical. I would also remove the primers with cutadapt before dada2 instead of trimming.

That is correct again! You can still process both datasets in parallel to see if results from one dataset follow the same trend as the second one.

How did you calculate the length? Different primers? Or sequencing technology (250x2 vs 300x2)?
If primers are different, but target the same region, then you need to use the primers that are present in sequences of both datasets, remove them and process separately, and then combine.
If it is 250x2 vs 300x2, then it should cause no issues if your reads are overlapping and merging in both datasets.

Best,
Timur

Shinthotrang · June 5, 2025, 12:48pm

Thank you for the confirmation and detailed help Timanix

The primer are different but yes those target the same region (V3-V4). It is based on sequencing technology (250x2 and 300x2).

From what I'm understanding, as long as I remove the primers and denoise the datasets separately.
If the results from dataset 1 is merged properly (having enough overlapped region), and the results from dataset 2 is also merged properly (having enough overlapped region).

Then I can combine them together using "qiime feature-table merge" and from there I can conduct further analysis. Is that correct?

timanix · June 5, 2025, 1:12pm

That is correct, if the following assumptions are true:

Sequences after primer removal are identical for the same bacteria in both datasets. It is important for merging the datasets, since otherwise even the same sequences produce different ASVs. If the difference is only in the degenerative nucleotides, but not in start/end positions, then you can use primers that were used for amplification of each dataset.

Pay attention to the orientation of reverse primer!

Make sure that % of retained sequences in both datasets is as high as possible. V3-V4 region with 250X2 may fail to capture the group of bacteria with the longest v3-v4 region, while the same region sequences with 300x2 usually suffer from quality issues.

Hope that helps.

SoilRotifer · June 5, 2025, 4:49pm

Hi @Shinthotrang,

Just a few things I'd like to add to @timanix' great comments...

You should be aware of these confounding issues:

PCR reaction biases
- PCR primers will incur primer amplification biases that will not necessarily be fixed by trimming to the same length. Often the composition and diversity of samples can be primarily associated with the PCR primers used over the biological signal you are investigating.
truncation / trimming parameters
- should be identical for the same primer pairs. That is, if you are using the same amplicon across different runs, and truncate the reverse read by just a few base pairs for one run and not the other, you'll generate many non-overlapping ASVs between the runs. This is why it is often recommended to use the same parameters for all runs that use the same primers.
- In your case, when trying to combine sequencing runs generated from different primers sets, even if these primers target the same region and only differ by a few bases, you'll observe even more exaggerated problems from #1 above. With the additional conflated problem of requiring different trimming / truncation values.

The TL;DR is, using different trimming/ truncation parameters along with different primer pairs, across different runs, can lead to lots of spurious interpretations.

Check out this post.