Fragmented shotgun amplicons and dada2

Hi, I am exploring using q2 to analyse MiSeq data from fragmented rRNA amplicons (shotgun amplicons). The amplicons are of varying sizes of the 16S/18S (300-1500) and are fragmented to around 300 bp and then sequenced with 2 x 250 or 2 x 150 kits. The reason we are taking this approach is to use an already established method in the lab and to generate longer amplicons (I am also assembling ASVs out of q2). I have used qiime 'cutadapt trim-paired' to remove primers, then 'dada2 denoise-paired' using --p-trunc-q 10, as this avoids the problem of different length sequences. I’m just including paired reads to maximise length. I've then assigned taxonomy using the SILVA database. My query is, is there any reason dada2 would not be appropriate for this kind of data? This thread Denoising of Paired End Reads - #14 by colinbrislawn mentions that dada2 is ‘making a rather strong assumption that the reads all begin in the same place’, but as dada2 can be used for ITS regions which are highly variable, I wonder how important this really is. Any suggestions would be gratefully received.

Hello Sarah,

I'm glad you found my old thread, and should probably explain this better.

I'm leery of 'shotgun amplicons' because they are a very different animal than typical amplicons, which can be a trap for newcomers in the field. For example, they require some sort of assembly pipeline and there's not a Qiime2 plugin for this (yet! :stuck_out_tongue_winking_eye:).

But this might not be a problem for you and your team! :fireworks:

Sounds cool! (You can promote the paper or the GitHub repo, if you would like.)

Who's Afraid of Shotgun Amplicons?

Me! :scream_cat:

Amplicon sequencing has one big advantage: all the reads come from the same part of the same gene.

Modern methods for handling amplicon data make great use of this.

  • Dereplication is possible because the most common microbes will produce identical reads, allowing for massive reduction in data size for faster processing.
  • Denoising is possible, because biological variance looks different than instrument variance, and these sequencing errors can be modeled and reduced.
  • There are still lots of biases from the primers, gene copy number, extraction method, etc. But all this bias is consistent sample-to-sample because it's always the same region of the same gene.

Shotgun sequencing does not have these benefits.

Also, it introduces some new challenges:

  • Assembling hypervariable regions is computationally expensive, but not impossible
  • Assembling across conserved regions is impossible, unless you have mate-pair reads that span that region (cite) or longer reads to act as a guide
  • Quantification of abundance, even comparative sample-to-sample abundance, is tricky. If a read exactly matches different full length 16S genes, which gene do you count it towards?

These are all core problems inherent to shotgun sequencing, and people are using long read sequencing on amplicons to avoid them.

It sounds like your team is well aware of these problems, and might even have a solution to some of them If you have a new method or solution, let us know!

Yes. Shotgun sequencing has different assumptions than targeted sequencing.

'Shotgun amplicons' are strange beasts.

:unicorn: :dragon:



Thank you Colin, this is really useful!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.