Too many elements for DADA2 - denoise in batches?

Hi,

Like some other previous posters, I’ve run into this error whilst running DADA2

“Error in table(pairdf$forward, pairdf$reverse) : attempt to make a table with >= 2^31 elements”

All primers etc have been removed and the quality plots are nice, so I think it may be true biological diversity. I’m trying denoising with deblur as was the suggestion in the linked post but also thinking about whether it would be possible to split up my samples and denoise in batches. I am aware this is not the recommendation as error profiles across the whole run are useful, but mostly just thinking through what would be the least bad way to come at this.

The experimental design is testing the influence of industrial discharge on river bacterial communities, so there are samples upstream, downstream and at the discharge point for 20 different sites. I wondered about splitting them into sample type batches - assuming that up/down/effluent would be the dominating characteristic, so this should ‘limit’ the diversity in each sub batch versus going by site or randomising which may run into the same issue with too many elements.

I confess I’m not the most informed on the inner workings or assumptions of denoising methods, so wondered if anyone could share thoughts on why this approach may not be ideal beyond ‘it’s best to denoise the whole sequencing lane together’.

Many thanks!

1 Like

Hello!

Sorry for the very late reply. I saw your post, but I waited to see if some other Qiime2 members with more relevant experience would handle it since I am not sure that my suggestion would work. But probably I can start, and maybe someone will jump in.

Indeed, it looks like you are dealing with very high numbers of unique ASVs. I also encountered a high number of unique ASVs in river/fish samples, but it didn't trigger an error. Probably, my datasets were smaller in size.

Did you discard sequences with no primers detected at the Cutadapt step? If you lose too many sequences with this parameter enabled, it means that primers are not detected and are probably still in the sequence, or that they were already deleted by the sequencing center before.

You can split your dataset into batches and denoise them separately, but make sure each batch has approximately 1M reads (I aim for at least 500K, but sometimes we don't have much choice). Ideally, you should split samples by a metadata factor that is not so important for comparisons. For example, if you have treatments and sample types, and you are mostly interested in testing the effect of the treatment, and not so interested in differences between sample types (you already know that they are different and not going to prove it), then it makes sense to split it by sample type to have all the treatments in one batch. If not applicable, then you split it randomly.

Between-batches comparisons are still possible if you run cutadapt and DADA2 with identical parameters, so you can get exactly the same ASVs in multiple runs. It should minimize the batch effect.

Best,

2 Likes

@benjjneb or @jordenrabasco - do either of you have suggestions here?

This has been an error that cropped up every year-or-two, but rarely, and we have not been able to nail it down. One time it was someone trying to process a shotgun metagenomics dataset through DADA2.

Processing in batches is a valid choice with ASV methods like DADA2.

1 Like

Thanks @benjjneb! Do you see any issue with partitioning ASV data based on samples? For example, if there are 100 samples from one sequencing run in the full data set splitting it into four batches of 25 samples each? If no issues, that should be possible using qiime demux partition-samples-paired (in my example, num_partitions would be set to 4).

1 Like

Yes that is an appropriate approach.

3 Likes

Thanks for the input @benjjneb!