About joining paired reads and quality control

juandsal · August 8, 2020, 4:06am

Hi, I'm trying to analyze for the first time a set of 16S rRNA samples obtained from human faeces.

Short description

V3-V4 amplicons, paired-end sequencing (2x300 bp). Sequences were given already demultiplexed, so I create a manifest file and imported to QIIME2 (conda). There are 68 samples, each one with ~100.000 sequences of 301bp length.

So far, I have managed to import the sequences and generate a .qzv file.
paired-end-demux.qzv (312.6 KB)

1. At this point, should I trimm the first 16 bp of the FORWARD read (median Q score < 28) in order to assure a better quality downstream? How do I do this?

Regarding joining of paired reads

According to this tutorial (demultiplexing flowchart), next step is vsearch join-pairs.

2. Should I use the parameter --p-truncqual at this point? If I don't trim my forward sequence prior to this (as stated in question 1, and assign for example a Q of 30), given the quality of my sequences: would this mess up the sequences and the process? Is it better not to use this parameter?

Should I keep the default --p-minlen value (1)?

3. What is the expected overlap of the sequences? How this affect the default parameter value of --p-minovlen (10)? When and why should I change this parameter?

Regarding Quality Control

4. Why the quality-filter q-score-joined method is said to be deprecated? Why is still recommended in the tutorial? What would be the alternative? Or is ok to still use it?

Regarding Denoising

I'm aiming to use Deblur and to generate ASVs. Any advice prior to reaching this step?

Thanks in advance for your time and help.

colinbrislawn · August 12, 2020, 3:31am

Hello @juandsal,

Welcome to the Qiime 2 forums! I appreciate your detailed description of your data type and clear questions.

I would like to address your questions in reverse order.

Regarding Denoising

There are many good options for denoising! Deblur is a good option, and DADA2 is another. This choice is based on your input data (read length, region sequenced, overlap, etc), and is up to you.

Regarding Quality Control

The method used in quality-filter q-score-joined is depreciated because there might be even better options available, depending on the data you have.

Modern methods often perform quality control, joining, and denoising as a single step, or use the processing of joining reads to increase the confidence (q-score) of the output data.

If you only have joined reads, then this method could be your best option. But you have unjoined reads, so you may be able to make use of these better methods.

That section is more of a 'Conceptual overview of QIIME 2'...
Compare that to the Atacama soil tutorial, that also uses paired end Illumina reads from 16S rRNA samples, just like you have! The main difference is that the sample type is Atacama Desert instead of human feces , but the steps for this data type are the same.

The Atacama soil tutorial uses DADA2, which is one of the methods that performs joining and denoising as a single step.

Regarding joining paired reads

Using the Atacama soil tutorial as an example, the goal is to trim off the low quality start and end of your reads, so that more reads can join and help to error correct each other.

In that tutorial, they suggest:

  --p-trim-left-f 13 \
  --p-trim-left-r 13 \
  --p-trunc-len-f 150 \
  --p-trunc-len-r 150 \

Based on your quality score plots, I would suggest starting with

  --p-trim-left-f 25

because I see some low quality score at the start of your forward read.

Regarding joining of paired reads

That depends on your forward and reverse read length (300f + 300r) and and the total length of the V3-V4 region you sequenced. What primers did you use?
(Here is a short example of how read overlapping works.)

DADA 2 does this a little differently than vsearch join-pairs but knowing the expected area of overlap is still very important!

I think you are off to a good start, and are asking all of the right questions. My first question for you is about your primers and their area of overlap. With that answered, we can select a denoising method (dada2, deblur, something else) and start choosing the best settings for your data.

Keep in touch,
Colin

system · September 12, 2020, 9:08pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.