Processing Illumina fragmented reads

Hello everyone,

Searching how to improve my taxonomic results in forum I have found this topic that is something like our proceeding. We amplified full 16S gene, randomly fragmented at 150pb and sequenced by MiSeq Illumina platform. After that, Illumina BaseSpace suite is used to primer and barcode trimming and demultiplexing. Initially, we proceed with "Moving Pictures tutorial" pipeline with some changes like dada2-paired instead of dada2-single or the addition of orient-seqs. Our taxonomic results don't fit with the expected abundances (we use a mock community to check that) although the observed bacterial genres are what they should be.

Now, I read that DADA2 is not capable of processing this kind of reads because of DADA2 presumptions. How should we proceed? In the topic linked above it is suggested to merge R1 and R2 files but I am not sure how this is done. How do you get the Feature Table and the Representative sequences files?

On the other hand, for shotgun analysis, some forum topics recommends to use q2-shogun or q2-metaphlan2 to do the taxonomic classification but if we are interested in explore alpha diversity, we need the pre-processing step to obtain Feature Table and Representative sequences.

What procedures should we follow to analyze our samples? Do you have any other suggestions?

Thank you!


Hello Sergio,

This sounds like a cool approach. Sequencing untargeted fragments of the 16S gene would cover parts not included by a specific hypervariable region, and that could help improve your taxonomic resolution.

I should clarify this. While your sequencing does not target a specific region of the 16S gene, shotgun/untargeted sequencing does not target any specific gene at all, so you get lots of functional and marker genes in addition to the 16S ribosomal genes. But your data set only includes the 16S gene, so I'm not sure these tools are appropriate either.

The good news is that there is still a path forward. You can still merge paired ends if your fragments are shorter than your MiSeq reads, then dereplicate and closed-ref cluster against the a database to unify all these reads from different regions.

And the best news, is that you can optimize the settings in this pipeline because you have a positive control! :bar_chart: :petri_dish: :straight_ruler:

Steps like read joining can be biased against longer fragments, causing fewer of them to join and removing them from downstream analysis. This could be changing the taxonomic composition you are seeing. The hypervariable region sequenced can also change observed taxonomy (link, link), and because you are sequencing fragments from all regions, I'm not sure what biases to expect compared to the known results.

Let me know what you find, and if you have any other questions about how to work with your untarged amplicons.



Thank you Colin!

We will test your recommendations and write to you to give you feedback on this pipeline.

This is important to keep in mind. We have tried using the BaseSpace Kraken 2 plugin without prior processing of the data and the results are close to what we expect but it is a very closed procedure and does not allow changing parameters, so we are interested in working with Qiime. With less restrictive processing than DADA perhaps it does improve taxonomic composition.

Let's see what we find!


PD: Finally we tried a new workflow using vsearch to merge paired-ends, dereplicate and closed-reference clustering (at 99% similarity) and then the taxonomy classification with classify-sklearn. Results are closer to expected than before but still have imperfections. We decided to use Kraken2 taxonomy classification and import the results in BIOM format to Qiime2. Thank you for your help because it will be useful in future projects.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.