How to denoise samples so less is lost with chimera removal?

emmzee · October 21, 2019, 8:43pm

I used cutadapt to search a set of single-ended demultiplexed sequences for my specified adapters using the forward primer sequence (--p-front), the reverse complement of the reverse primer sequence (--p-adapter), while adding a few other parameters to allow for IUPAC nucleotide reading (--p-match-read-wildcards), and with default options for chimera checking. I also discarded all untrimmed sequences (i.e. reads in which the specified adapter was not found).

Based on the sequence quality scores above, I used dada2 denoise-single to filter and dereplicate my sequences, and remove chimeras from them. I first used a 235 threshold for --p-trunc-len, which made me lose a very large number of sequences in the first step, so I decided to trim more sequences by lowering that threshold to 225, and I ended up losing significantly less sequences in the filtration step. However, I still end up with less than half of my sequences after chimeras checking and removal. So, finally, I adjusted the expected error rate from 2 to 5, but this did only a little to improve my results.

One of the suggested explanations on the forums (loss of reads after DADA2 as chimeras) for this issue is DADA2's struggle to deal with non-biological sequences (e.g. adapters/barcodes). In my amplicons, the barcode sequence proceeded the linker primer sequence. Will the barcode sequence be removed from my sequences with cutadapt with the parameters I specified (see below)? If not, how do I remove it after the cutadapt step?

qiime cutadapt trim-single --i-demultiplexed-sequences single-end-demux.qza --p-front [NGS-adapter+LinkerPrimerSequence] --p-adapter [ReversePrimerSequence+NGS-adapter] --p-match-read-wildcards --p-match-adapter-wildcards --p-discard-untrimmed --o-trimmed-sequences single-end-demux-cutadapt-trimmed.qza

Is the --p-trunc-left necessary in my case, based on the graph results?
If I were to experiment with --p-min-fold-parent-over-abundance, would using a value of 2 or 3 be plausible?

I appreciate your time and support

Mehrbod_Estaki · October 22, 2019, 5:55pm

Hi @emmzee,
Thanks for your detailed description of your problem!
Probably the easiest way to check this is to just look at some of your raw sequences and see what is at the beginning of your reads. Do they match your used barcodes? With your set-up I'm guessing not and you're ok.

Consider this, a typical setup looks like so:

PCR II: [flow-cell-adapter] + [barcode] + [linker]
PCR I: ____________________________[linker] + [spacer* + gene primer]

*heterogeneity spacers are optional

Usually demultiplexers remove the barcodes and everything preceding it. Then one would use cutadapt to remove the remaining non-biological linker and anything else attached to it, like spacers etc, or maybe even if your gene primer.

But these situations can get a bit tricky since there are a lot of variations of these set ups. For example:
When you said

Then this would mean the set up is:
[flow-cell-adapter] + [linker] + [barcode] + [gene primer]

In other words, your barcodes and primers were attached during PCR I? If that's the case then after demultiplexing you would have -assumingly- already removed your linker + adapters.
You also go on to use cutadapt to remove your forward primers and everything preceeding it next anyways, so I don't think you have a problem with non-biological sequences in your reads. But always good to do a sanity check.

Yes! In fact, I would argue fine-tuning this step is what is going to help you retain more reads. From your quality plots I see a couple of dips around the 120 and 150 bp marks. It is possible you are losing many reads around these points. If you can afford the resources (and for your own learning experience) I would run dada2 a few more times, let's say once at 110, 140, 170, then 210. This will really give you a good idea where you are losing most of your reads, then you can choose based on those results which one works best for your dataset. It's hard for me to speculate anymore without your dada2-stats summary viz.

3 is very safe. If you are suspecting you are losing too many reads because they are falsely being flagged as chimeric, you can try raising that to say 5 or 8 and compare. But I personally find that this step is not really an issue in most cases. It is also check this in q2-dada2 since the discarded sequences are not retained, in R, you can grab those reads and blast them to see if they were indeed marked as chimeric correctly or not (kind of). It most often comes down to fine-tuning the truncating parameters imo.

Hope this helps!

emmzee · October 29, 2019, 12:37am

Thank you @Mehrbod_Estaki, I have a few questions about your comments before I proceed to experiment with this plugin. But first, I need to make a small correction:

What I meant to say is that the barcode sequence preceded the linker primer sequence and gene primer. Therefore, my set-up would look like what you suggested in your post - although I have no idea what a flow-cell-adapter is, or if it's applicable to ion torrent sequencing:

PCR II: [flow-cell-adapter] (?) + [barcode] + [linker] + [gene primer]

I have also checked my demultiplexed fastq files and the barcode sequences were already removed, but not the linker sequence.

I would run dada2 a few more times, let’s say once at 110, 140, 170, then 210.

Correct me if I'm wrong, but if I use something like 210 for --p-trunch-left in addition to the 225 --p-trunc-len I have, wouldn't that leave me with sequences of 15 bp length? I'm assuming that with --p-trunc-len set to 225, I'm cutting around the 225 mark on the graph.

Mehrbod_Estaki · October 29, 2019, 1:06am

Hi @emmzee,

Aha, sorry I didn't realize this was ion torrent data. To be honest I am not very familiar with this technology at all so I can't give you any definite recommendations. You may want to look up ion torrent on the forum in previous posts as there have certainly been some special considerations with this type of data.

With Illumina runs this is what binds to the flow-cell (little chip) that your sequences bind to before being read. Again, I'm really not sure how this goes with Ion Torrent.

If these are not removed you should certainly remove them using cutadapt again first before dada2 and adjust your truncating parameter accordingly.

The --p-trunc-len-f parameter is only available when you are running dada2 denoise-paired as a way to differentiate it with --p-truc-len-r, in your case since you are running dada2 denoise-single there is only --p-trunc-len. since there is only one direction.

emmzee · October 29, 2019, 1:12am

The --p-trunc-len-f parameter is only available when you are running dada2 denoise-paired as a way to differentiate it with --p-truc-len-r , in your case since you are running dada2 denoise-single there is only --p-trunc-len . since there is only one direction.

I made a mistake reading the dada2 paramters - it seems it's not 'trunc' but 'trim'. So the parameter I was asking about was --p-trim-left. Sorry about that. Would the discussion still be the same in that case? In other words, would it still be helpful to use it in my case?

I will make adjustments to my previous posts to prevent confusion for readers.

Mehrbod_Estaki · October 29, 2019, 1:29am

Hi @emmzee,

The trim cuts sequences from the 5' direction while truncate trims from the 3'. The values you set refer to the position in the sequences.
Given your barplots you would really only need to worry about the truncating value. The 5' values look good enough. Although, the DADA2 tutorials does recommend a 15 bp trim from the 5' with Ion Torrent data, so that might be worth considering.

emmzee · October 29, 2019, 2:25am

Thank you @Mehrbod_Estaki! I have a much better understanding of how dada2 works now, and to use it efficiently.

Edit: @Mehrbod_Estaki, I just realized there's an inconsistency in the total bp size found in the graph and size I expected. The original size total size of the fragment was 310 bp. After removal of adapters and barcode sequences, it should be much less than that. Does this mean quality score plot show the total sequence size prior to trimming the adapters and barcode sequences?

Mehrbod_Estaki · October 30, 2019, 3:40am

Hi @emmzee,
If you hover over the last few bps in your quality plots does it say that those points were plotted using some low number of reads? By default cutadapt retains all reads even if some of them didn't have the barcode/primer/adapter etc in them. This means that those reads that didn't have anything trimmed will be the same size and thus show up on the quality plot.
I like to get rid of those reads by setting the --p-discard-untrimmed parameter. That should show you the appropriate size plot.

emmzee · October 30, 2019, 3:36pm

Hi @Mehrbod_Estaki, I did use the --p-discard-untrimmed parameter in my case. At position 236, it drops from, say a subsample of ~9k to around 700 and continually decreases from there to even lower subsamples. I believe this is also supposed to help guide me with the other parameters? How do I know if these are naturally longer fragments or if for some reason they weren't trimmed for non-biological sequences (i.e. adapters)? I also checked with one of my other primers, which has a similar target fragment size, and it seems to do the exact same thing. Cool stuff!

Mehrbod_Estaki · October 30, 2019, 7:46pm

Hi @emmzee,

That is interesting indeed, it could be that they are real biological signals (though I doubt it, purely from my own experience) or they could be chimeric? Let's see if they come out on the other end of DADA2. You can always blast a few of those longer ones to see what they are too..hard to say at this point.

emmzee · October 30, 2019, 7:51pm

Is there a way I could generate this same plot after dada2 sequence filtering is complete? Or a way to get fragment size summary, such as maximum fragment length detected or mean fragment length?

Mehrbod_Estaki · October 30, 2019, 7:55pm

Hi @emmzee,

You can't get a quality plot like before DADA2 because the new output don't have quality scores. But you can produce a new visualization on your rep-seqs which gives some insight into the length of these, as well as ready-to-click hyperlinks to BLAST

qiime feature-table tabulate-seqs \
  --i-data rep-seqs.qza \
  --o-visualization rep-seqs.qzv

emmzee · October 31, 2019, 2:35am

Thanks for sharing that visualization script, I wasn't aware you could visualize the rep-seqs output from dada2 and it was really useful. Now I know more and I have more concerns.

So the BLAST hyperlinks make sense. But the summary itself, which shows the sequences after the cutadapt process, presents problems. After using that trimming command (--p-trunc-len) at 225, all of my features are exactly 225 bp. This makes no sense. What about features that are less than 225?

Sequence Count	Min Length	Max Length	Mean Length	Range	Standard Deviation
445	225	225	225.0	0	0.0

I tried re-running with the trimming parameter set to 0, and the trunc parameter the same (0). Now I get something like this:

Sequence Count	Min Length	Max Length	Mean Length	Range	Standard Deviation
616	25	307	195.27	282	63.33

Still, in both scenarios I can CTRL+F my --p-front sequence, and it's searchable in many features (~100). This confuses me, as I'd expect them to either be completely absent if cutadapt worked, or always be present if it didn't cut them out.

Mehrbod_Estaki · October 31, 2019, 3:47am

Hi @emmzee,
Glad you found it useful!

This is actually the expected behavior. From the dada2 denoise-single help file:

--p-trunc-len INTEGER  Position at which sequences should be truncated due
                         to decrease in quality. This truncates the 3' end of
                         the of the input sequences, which will be the bases
                         that were sequenced in the last cycles. Reads that
                         are shorter than this value will be discarded. If 0
                         is provided, no truncation or length filtering will
                         be performed                               [required]

Note this line: Reads that are shorter than this value will be discarded

When you disable trim/truncating by setting it to zero you are essentially saying include all length reads so again the variable length in your second run makes sense.

As for why some of your sequenced didn't get the trim, that is curious. Could you provide us with the full commands you are running, or better yet share your rep-seqs.qza file? We can look through the provenance that way.

emmzee · October 31, 2019, 8:13pm

Note this line: Reads that are shorter than this value will be discarded

Ah yes. This explains everything. Reads that are shorter than that value get discarded, and reads that are longer get trimmed to that length.

As for why some of your sequenced didn’t get the trim, that is curious. Could you provide us with the full commands you are running, or better yet share your rep-seqs.qza file? We can look through the provenance that way.

Here is the rep-seqs from my second run, when I set all the dada2 parameters to zero. If I have to remove the attached file for some reason, I will replace it with the commands I used instead.

Edit: Is it possible that because some sequences are chimeric, the forward primer sequence/ adapters are present twice so cutadapt only cuts the first time?

Mehrbod_Estaki · November 12, 2019, 7:03am

Hi @emmzee,
Excuse my delayed response on this.

Theoretically yes, but quite unlikely.
I noticed that the sequences you are trying to remove are quite long:
CGAATRAAYAAYATRAGYTTYTGGCTCGGTGGCGT and contain many wildcard nucleotides. what exactly are you typing in your "Ctrl F" search to look for these ~100 reads that supposedly have the sequences in them still?
If this is only happening in 100 or so cases it may be that these get dropped after DADA2 anyways.
Could you also provide us with the --verbose output of your cutadapt run, this may give us a clue as well. Again, this may be a non-issue, but a mystery is always worth solving!