Effect of not preparing data before DADA2

Hello,

This is my first time working with this type of data and I’m very happy that this forum exists. Amazing community you guys have here!

I’m trying to run a “meta-analysis” by downloading 16S sequences from NCBI. Because I’m running the analysis on a lot of information it’s hard to know or check if the information I’m using is homogeneous.
Obviously, I’m filtering the data to get the same Instrument, to make sure they are 16S, etc.

My data pipeline is working but I want to make sure I’m getting the right results.

Specifically for denoising with DADA2:

  • What happens if non-biological sequences are present?
  • What happens if I decide not to trim the sequences? I assume that the sequences are discarded if they have sub-optimal quality. I’ve tried using the p-trunc-q parameter but anything >10 yields almost no sequences passing the test. 2 (default) and 5 seems to work.

After all the process I get around 5000 sequences assigned to taxa (about 290 unique groups)… is that a normal number? I’m analysing soil data, using Greengenes_97 as a reference. I ask just to have a reference.

Thanks.

Hi @spadarian,

Welcome to the community! All great questions here.

Don't worry about it too much — dada2 will remove any chimeric sequences, and you can do other things (e.g., run qiime quality-control exclude-seqs or use q2-feature-insertion) to filter out sequences that do not resemble your reference sequences. The one problem you should worry about is whether your reads contain partial biological partial non-biological, e.g., if you have issues with read-through or you have adapters in your reads for other reasons. You can use q2-cutadapt to trim your primers and any adapters from your reads to prevent this from occurring.

If Q=10 flops, it sounds like your sequences are quite low quality and get trimmed too much to overlap (sounds like paired-end reads?). Not to worry — you can use the default Q=2 and dada2 will toss out any noisy sequences.

But to answer your question: you will want to trim all sequences to the same position if your data are single-end reads. Otherwise, sequences from different runs will have different lengths and hence have different ASVs (since each run would have its own set of unique sequences!).

You can use closed-reference OTU picking after dada2 to control for this, or use q2-fragment-insertion to splice these sequences into the same tree to minimize the effect of run-to-run variation on phylogenetic diversity estimates, but controlling sequence length would be easier if these are from the same 16S region (note that q2-fragment-insertion does allow you to do metaanalyses across different 16S regions!).

If your data are all paired-end reads, then you can worry less about standardizing trimming or even not trimming at all — you will just want to ensure that you are getting good read joining yields across runs and that some runs are not being systematically biases because a larger proportion of reads are not joining in that run (e.g., due to lower quality causing the reads to be trimmed more).

That sounds reasonable — I do not do too much soil work, though, and ultimately the number of unique sequences/taxa will depend on sequencing depth, soil type, etc.

I hope that helps!

1 Like

Thanks for your reply @Nicholas_Bokulich.

Sorry, that is what I meant with my first question. I inspected some of the fastq files and the sequences look different, so I assume that they don't have any primers, etc. Am I right? Probably a requirement to upload files to NCBI?
If by any chance some sequences contain partial biological partial non-biological, would that sequence pass the filtering?

I'm processing both paired and single reads. Here is an example of a paired read:

I had that on my pipeline at some point but I removed it after checking the tutorials. They showed examples for paired reads so I guess I will add it just to process single reads.

When I used closed-reference picking after dada2 on paired reads, I ran a feature-classifier classify-consensus-vsearch to assign taxonomy to the sequences and I got fewer sequences than when not using closed-reference picking (I don't remember the exact numbers now but something like 200 unique groups vs 290). Are those extra sequences (without the closed-reference picking) real?

The final goal is to have as many samples as possible.

  • Is it possible to mix paired and single reads? If it is too complicated, I guess I could exclude the single reads.
  • According to your comments, with paired reads I need to make sure that the joining yields are good. Should I worry about something else?
  • For single reads, I need to worry about 16S regions and sequencing depth... will closed-reference picking after dada2 help with that? Should I worry about something else?

Thanks again and sorry for the zillion questions. This is very interesting!

Probably — but I cannot say for certain. Sounds like a safe assumption.

Yes it would pass, e.g., if you have amplicon read-through into the adapter on the 5' end that sequence would pass as far as I know. That's why using q2-cutadapt can be useful if you suspect adapter sequences may be present in your data.

Those "extra" sequences probably are real (and I would recommend looking at the different # of ASVs, not the different number of unique taxa after classification, since that is collapsing the total amount of variation present). But they might also include non-target sequences. You are looking at soil, which is more likely to contain uncharacterized species than a well-characterized sample type. So many of the sequences that are lost following closed-reference OTU clustering may simply be uncharacterized taxa that do not resemble the reference dataset. That is sometimes the point of using closed-reference OTU clustering if your samples are well characterized, but in soil it would cause you to miss some real sequences. I would recommend using qiime quality-control exclude-seqs instead and/or just through out sequences that receive no classification by the vsearch-LCA taxonomy classifier in q2-feature-classifier; you can set a similarity threshold so that you are throwing out any sequences that clearly are non-target, but not clustering into OTUs.

Yes; same advice as above, you will need to use closed-ref OTU clustering or q2-fragment-insertion to effectively graft these together. See the q2-fragment-insertion tutorial for more information.

Sounds like a trick question! With paired-end reads you should still care about quality yields overall, but in my opinion joining yield is the most important thing to assess so that you do not introduce systematic bias. So after running dada2: 1) check the overall yield; 2) confirm that the % of reads failing to join is acceptable (some will be lost! so a small fraction is okay), acceptable here meaning unlikely to introduce a systemic bias; 3) confirm that the % being lost during the filtering step is not unacceptably high, acceptable here meaning you still have enough reads to work with. If you are losing heaps of reads during pre-filtering you may want to adjust your trimming or dada2 pre-filtering parameters to preserve more reads for denoising... but how much is okay to lose depends more on sequencing depth and is unlikely to be related to a systemic bias (since sequencing errors should be stochastic, unlike read joining which will be based on total amplicon length).

If you have different 16S regions then yes closed-reference OTU clustering against full-length 16S will help you compare across regions, but so will q2-fragment-insertion. See the tutorial for that plugin to learn more. That will not address issues with sequencing depth — adjust your dada2 trimming/filtering parameters as discussed above to improve yields if you are losing too many sequences during denoising.

Hello again @Nicholas_Bokulich

So I've been playing with the paired data a little bit.

The merging process actually excludes many sequences. Here is the proportion of sequences that pass each step:

Of course, that is only the proportion of the initial number of sequences. If I check the number of final non-chimeric sequences per sample, it's not that bad. These are the quantiles:

  0%      25%      50%      75%     100% 
 4.0   1890.5   7339.0  15929.0 327346.0

So most of the samples have a decent number of non-chimeric sequences. Right? I assume around 2000 is not that bad... but looking at the maximum number of non-chimeric (327,346!) makes me doubt. Any advice here?

The ones with really low yield (4 non-chimeric for example) have a quality plot like this:

Most of the sequences have a quality of 2 in the middle, so they get trimmed by DADA2 and probably they are not long enough to merge them. I can just discard those I think. If I really have to, can I just use the reverse reads?

Cheers!

1 Like

Yeah, that merge yield looks pretty bad... I would recommend troubleshooting that before proceeding. The chimeric yield does look good, you are right.

Yeah that quality plot looks really bizarre. What happened there? dada2 would most likely filter those reads before denoising and well before joining, so I am not sure if that is what is causing sequences to fail read joining. 100 nt paired-end reads are unlikely to join, depending on what 16S domain you are looking at (they would not for V4). So yes, you may just want to proceed with the reverse reads unless if you know that this should join.

No idea what happened there. Here is the link to NCBI: SRA Archive: NCBI. Actually, most of the samples with low yield look like that. Any idea why? I don't have much experience with this kind of data... but it does look weird.

Cheers.

Just to complement the previous posts:
If I run dada2 on paired or just the reverse reads (the same data as the previous quality plot) I get these denoising stats:

      sample.id  input filtered denoised merged non.chimeric
1     SRS825768 120261   117107   117107      4            4
2 SRS825768_rev 120261   117729   117729     NA       105413

From those 105,413 sequences, 103,175 get assigned to a taxa if I run feature-classifier classify-consensus-vsearch against GG_97. Unless I'm completely wrong, I assume that if they have a match in a reference database is because they are real biological sequences.

This advice seems to apply when I want to do something like beta-diversity analysis. What if I'm just interested on alpha-diversity?

Thanks again for your help!

Sounds reasonable. Depends on what the match is, though — e.g., if a read is only classified to phylum, it could still represent non-target DNA.

Comparing alpha diversity between a range of different primer sets/datasets would probably not yield a meaningful comparison, since it is well known that different marker genes and different primer sets can impact alpha diversity (e.g., they have different biases, coverage rates, etc).