Merging across multiple batches giving exorbitant feature founds

Lamm-a · June 9, 2023, 2:56pm

Version: qiime2-2022.8
Instillation: conda

I have 408 samples what were split into 18 batches each with ~ 23 multiplexed samples in. After having imported, demultiplexed and de noised all of them I then wanted to merge all samples together. When I do this I get a crazy high feature count, around 600,000. When looking at the individual dada2 outputs I have no where near this many features. Given that the samples are coming from the same environments I would expect to see many similar features across them but this does not appear to be the case when looking at shared features.

Is this normal/expected or is something fishy?

I expect it is because dada2 is producing outputs of many different lengths and therefore similar features are (rightly) seen as different when merging.

Is an acceptable solution to this for me to trim all reads to a given length and discard those below that length prior to analysis? (this will of course remove a lot of potentially real data which sucks also.)

SoilRotifer · June 9, 2023, 3:14pm

Hi @Lamm-a,

Wow that sounds a bit high for unique features!

A few questions:

Have you used the same exact DADA2 parameters for each of your separate runs? If not, there is potential to generate many, ever so slightly different, features that are not identical to features in other runs.
How many of the features are of very low abundance? Often, with large data sets, many features are low count and can be removed. Yet you'll still retain most of your reads. For example, in one of my data sets I had 637 features with a total frequency of 1,679,804 reads, after some filtering I ended up with 196 features with a total frequency of 1,606,440 reads. So, ~400 or so features accounted for only ~73,000 reads. Even though I dropped 2/3 of my features, I still keep ~ 95 % of my sequenced data.

Lamm-a · June 12, 2023, 1:13pm

Hi @SoilRotifer

It is indeed crazy high!

I just changed my preprocessing to remove the primers with cut-adapt instead of hardcoded with dada2 using the --p-trim-left options along with --p-discard-untrimmed (which I has assumed was somewhat default given the language of --o-trimmed-sequences implying only trimmed sequences are saved?)

This has reduced my feature count to ~160,000 before filtering and then 35548 after (removing singletons and using decontam). Regarding your second point: I do indeed seem to have many low abundance features. When removing features with less that 100 reads I retain 96% of my sequencing data and reduce the feature count to 16057. This is still somewhat high though?

Regarding my DADA2 options, everything is identical for each batch:

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs $trimmed \
  --p-trim-left-f 10 \
  --p-trim-left-r 10 \
  --p-trunc-len-f 220 \
  --p-trunc-len-r 220 \
  --o-representative-sequences $repSeqs \
  --o-table $table \
  --o-denoising-stats $stats \
  --p-n-threads 16

However the proportion of shared features is very small. Blow shows a barplot of how frequently the features are shared (as in the sum of samples that are not 0 for a given feature devided by total sample count). The vast majority are still shared very little. Previous experiments have show there to at least be a core

SoilRotifer · June 12, 2023, 2:55pm

Hi @Lamm-a,

Thank you for providing the command used. As far as I can tell you are doing everything properly. It's, also nice to see that you were able to reduces the number of features!

I always use cutadapt, as I think it is a nice form of quality control. I figure that if I am unable to find the primer in the sequence... then what else is wrong with the sequence?

It is a bit odd that there seems to be lack of overlapping features. My only other thought would be if there were any differences in processing for each batch of samples? For example:

Were all samples collected / stored similarly?
Was the DNA extracted from all samples similarly?
Was the same sequencing preparation and sequencing facility used?
Were the samples from these randomized across the different runs? If not, run-to-run biases can inflate differences, especially the various sample types / treatment groups were run separately on their own run.

Every step along the way can have an impact in the sequencing results. Barring any of the above mentioned potential issues, I'm not sure what else could be driving the lack of overlapping features.

You can probably mitigate this further by using some of the other filtering methods within qiime feature-table ... especially qiime feature-table filter-features-conditionally ... This may help to only keep the features that are shared among your samples.

I assume the lack of shared features was not an issue previously?

Lamm-a · June 12, 2023, 3:10pm

Hi @SoilRotifer

Thanks for the continued help

I always use cutadapt, as I think it is a nice form of quality control. I figure that if I am unable to find the primer in the sequence... then what else is wrong with the sequence?

Yes it does make sense to use cutadapt over hardcoding cut-offs with dada2.

Regarding each of these:

Were all samples collected / stored similarly?

Was the DNA extracted from all samples similarly?

Was the same sequencing preparation and sequencing facility used?

Were the samples from these randomized across the different runs? If not, run-to-run biases can inflate differences, especially the various sample types / treatment groups were run separately on their own run.

Yes, to be the best of our ability this was the case given we collect samples in (very) rural Ivory Coast.
Yes this was all do identically.
Again all identical.
I did not randomize across runs. However some groups of samples did by chance cross over runs. To this point when looking ordination plots of the data we do see strong (expected) biological clustering:

image1541×914 106 KB

I believe the ordination plots can still be trusted as even if there is, for example, a feature split 20 times due to potentially differing lengths of something, that group of features will still then be present in the same biological samples if that makes sense?

Looking at the forum, notably here I see that I can use q2-vsearch to mimic the function of DADA2's collapseNoMismatch() which might alleviate, if not perfectly, some of this issue?

EDIT: I just tried this clustering and it barely reduces the feature count and the post decontam feature count it actually ever so slightly higher. So it would seem it is not a DADA2 thing?

One other option is to cluster at 99% or by genus from the taxa data?

SoilRotifer · June 12, 2023, 3:41pm

Yep. You can also try vsearch / deblur approach, and see if this helps.

I was just going to suggest this too. Great minds think alike!

Lamm-a · June 13, 2023, 7:35am

Yep. You can also try vsearch / deblur approach, and see if this helps.

What will this achieve that DADA2 does not? Or just try using it in the hope it is better?

SoilRotifer · June 13, 2023, 1:27pm

The vsearch / deblur approach is simply another denoising pipeline like DADA2. Though, how it denoises is different, e.g. using a premade error model, etc....

My understanding is that DADA2 denoises the forward and reverse reads separately prior to merging the reads. Meaning if one of the two reads is considered poor quality, etc., then the pair is discarded prior to merging. With the vsearch / deblur approach you have a chance of "rescuing" those pairs as the poor quality portion of one read can be corrected by the better quality bases of the opposite read when merging. You can read up on how vsearch merging works. You'll find some threads on the forum where visualizations of the vsearch merged reads show that the quality can increase in the region of overlap. Deblur will then denoise on the merged reads. You can not do this with DADA2 as it'd violates the error model.

Depending on the quality of the forward and reverse reads, sometimes deblur provides better results, other times DADA2 is better. So yeah, it may or may not be better.

Also, with deblur, you have to truncate to a fixed length, where as DADA2 allows length variation.

On another note, I often set the following for DADA2:

--p-pooling-method 'pseudo' \
--p-chimera-method 'pooled'

-Mike

SoilRotifer · June 15, 2023, 12:36pm

An off-topic reply has been split into a new topic: Alignment question

Please keep replies on-topic in the future.

system · July 16, 2023, 6:37pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.