Can removal of RNA-seq human reads expedite the analysis? (Especially DADA2)

I was running my analysis on tumor samples (32samples total 110GB of data) to identify the microbiome population. We have limited resources on cluster (1Node 48CPU 4GB RAM/CPU). I performed my analysis and got stuck at DADA2 step. This step is taking too much time and could not complete within the 3 days slot on cluster. Sometimes I get Error code 9, 11 which are associated with out of memory.
Here my question is, If I remove all the human reads HISAT2/STAR and then start my analysis. Would it be fine? or is there any way so that I can analyze the individual sample and later merged all the feature tables?

Kindly help to resolve this issue. Thanks

Hello @deepak,

It is fine to filter host reads before denoising with dada2, in fact it is desirable. This will significantly reduce the size of your input--in some cases the majority of reads are host reads. This could resolve your time/memory problem.

Regarding analyzing your samples separately, my understanding is that this is not desirable if your samples are from the same sequencing run because dada2 constructs an overall error model for the entire run, and thus the more information the more accurate it will be. If you have sets of samples that were sequenced independently however, then you can (and probably should) analyze those with separate dada2 runs.


Hi @colinvwood
Thanks for the information. Is there any precaution step which I should follow during human reads removal through STAR/HISAT2?

1 Like

Hello @deepak,

I'm unfamiliar with those tools specifically. But essentially host alignment will be a trade off between sensitivity and specificity (removing all human reads, removing only human reads). I'm sure these tools will allow you to parameterize where on this continuum you want to be.


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.