Using DADA2 with .fastq files downloaded from NCBI SRA - sequencing runs unknown

anna-schrecengost · November 17, 2020, 11:06pm

Hi everyone,

I am working on a meta-analysis of public 18S rRNA amplicon datasets, mostly downloading the raw sequences from NCBI SRA. I want to use DADA2 within QIIME2 for denoising and I understand that the error model necessitates that reads from different sequencing runs are run through DADA2 separately. However, for many of the datasets I am working with there is no information about which samples come from the same sequencing run. My question is, is there any chance I can get away with using DADA2 with all of the .fastq files from a single SRA study/publication together? Otherwise I am not sure I will be able to use this approach. Thank you!

Mehrbod_Estaki · November 18, 2020, 2:10am

Hi @anna-schrecengost,
Welcome to the forum!
Great question, and while I know that the recommendation is -as you mentioned- to separately run DADA2 based on each run, I don’t actually know how badly the error models will perform if you don’t. That being said, in the stand alone DADA2 package in R you can evaluate the error models a bit more in detail than its QIIME 2 plugin.

But before you go down that rabbit hole, if these are Illumina runs, you can actually use q2-deblur to denoise your reads all at once since Deblur uses a pre-trained static error model, rather than building those on the fly. This was in fact one of the motivating points behind Deblur’s design to allow various runs to be processed together.

anna-schrecengost · November 18, 2020, 5:39pm

Hi @Mehrbod_Estaki,
Thank you so much for your reply! Yes you are totally right I could use deblur as well, the only issue with that is I also have some 454 datasets that I would like to include, and as far as I understand DADA2 is the only denoising algortihm that can handle non-Illumina sequencing technologies? (Please correct me if I’m wrong!) I think I may end up going down that rabbit hole, or will need to exclude those datasets

Mehrbod_Estaki · November 18, 2020, 9:46pm

Hi @anna-schrecengost,
Ah, yes you wouldn’t be able to use deblur with the 454 data.
And I’m not too sure about other modern denoising algorithms for 454, perhaps MED can do it also? You’d have to check there.
That is certainly an obstacle with public data, perhaps you could try reaching the authors see if they have any information on those runs? Sorry couldn’t be of more help!

I’m also going to ping @benjjneb here, perhaps he can comment on reliability of DADA2 denoising if the run information are not available.

anna-schrecengost · November 19, 2020, 12:01am

Yes, contacting the authors is definitely another option that I neglected to mention, thanks for your help!

benjjneb · November 19, 2020, 3:41pm

In many cases there won't be a problem, particularly when the runs were performed using similar (or even same) instruments that therefore have nearly identical error models. If the instruments used had quite different error models, it will lead to some spurious diversity being reported in the high-error-rate runs, and some reduced sensitivity in the low-error-rate runs, mostly among low-frequency variants in both cases.

DanielSprockett · November 24, 2020, 5:00pm

Just qiiming in here (ha!), but I have also run into this issue when re-analyzing public datasets, and I was able to ascertain which samples were run on the same sequencing run based on their FASTQ read IDs:

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number>

See here for more details. I have also encountered samples that were sequenced multiple times, and then those runs were apparently merged into a single FASTQ file before being deposited. So if anything looks fishy, it might also be worth checking to confirm that all of the reads in a sample come from the same run.

anna-schrecengost · November 24, 2020, 5:28pm

Hi @DanielSprockett, thank you for sharing this with me, and for the pun! I tried to recover the original fastq deflines by using fastq-dump --origfmt, but did not find any info there (they were just numbered). After digging around a bit (https://github.com/ncbi/sra-tools/issues/130) I found out that SRA stopped retaining this information with newer runs. Did you obtain the original read IDs a different way?

anna-schrecengost · November 24, 2020, 5:28pm

Hi @benjjneb, thank you for your reply! I see, in terms of the same instrument do you mean the same model or the same exact instrument? Thank you for this information

DanielSprockett · November 24, 2020, 8:36pm

@anna-schrecengost Oh wow, this is news to me! I completed the bulk of the analysis using data from ENA, which appears to retain the read IDs. Most of it was done quite a few years ago now, so SRA might have stopped storing this information in the mean time. What a shame!

anna-schrecengost · November 24, 2020, 8:49pm

@DanielSprockett it really is a shame! As far as I understand ENA mirrors what is stored in SRA, so I am not sure that the read IDs would be there, but I will check

benjjneb · November 25, 2020, 4:23pm

Same model same chemistry is typically fine. Same instrument is even better of course, but perfect shouldn't be the enemy of the good.

system · December 26, 2020, 10:23pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.