Which position is best for trimming in a paired-end approach?

Hello qiime community!

I'm bringing some questions about the pre-denoising step, the scan on the quality of my sequences and the best chioice about where to trim my reads.

First: I'm using the qiime2-amplicon-2024.10 and analyzing an Illumina paired-end sequencing 2x250bp.

Second: I have experienced two analyses in a single-end approach, because I had bad quality in my reverse reads. So I'm trying to go further with this paired-end analysis.

Here I have my paired-end data imported:
paired-end-demux.qzv (320.6 KB)
And the same file after applying cutadapt:
trimmed-seq.qzv (326.3 KB)

Questions:

After cutadapt application my guess to perform a good denoising (establishing a Phred-score of >25) is to retain my forward reads and trim-right my reverse reads at 173nt.

  • It is too stringent? Maybe I will get some problems to merge during the dada2 step? Which read length is sufficient to make a good merge?

  • There are some other remarks about my reads maybe I'm not paying attention? Or is a good run, with a sufficient number of reads?

Thanks in advance!

Hello João,

I just took a quick look at the read quality plot after importing, and R2 quality is rough!

I don't know of a perfect fix when the sequencing run itself went poorly. It may be easiest to also process this run with only R1.

You can always move on to the next step, like DADA2. Issues with read quality will appear here as reads that do not pass quality filter or reads that are not able to join. Because DADA2 does it's own quality control, you can try it and see how well it works! It will tell you tell you about the problems.

Maybe I will get some problems to merge during the dada2 step?

Probably... so try it and see for yourself!

Let us know what you try next!

2 Likes

Hi @colinbrislawn,

Thanks for your insigths! I just have tried some tests with dada2, three different tries:

Test 1: It gaves me the best of the percentage of input passed filter, although I'm not satisfied with the values, some samples were around 15-30%... And the highest around 60%.
denoising-stats.qzv (1.2 MB)

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs trimmed-seq.qza \
  --p-trim-left-f 0 \
  --p-trim-left-r 0 \
  --p-trunc-len-f 0 \
  --p-trunc-len-r 173 \
  --o-table dada2_data/table-paired.qza \
  --o-representative-sequences dada2_data/rep-seqs-paired.qza \
  --o-denoising-stats dada2_data/denoising-stats.qza

Test 2: I've tried to improve truncating my forward reads a little bit, but it gaves me worst results.
denoising-stats_2.qzv (1.2 MB)

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs trimmed-seq.qza \
  --p-trim-left-f 0 \
  --p-trim-left-r 0 \
  --p-trunc-len-f 245 \
  --p-trunc-len-r 173 \
  --o-table dada2_data/table-paired_2.qza \
  --o-representative-sequences dada2_data/rep-seqs-paired_2.qza \
  --o-denoising-stats dada2_data/denoising-stats_2.qza 

Test 3: Finally, at this test I've decreased the number of nt that I was truncating in my reverse reads, I had improved a bit my percentage of input passed filter compared with test 2, but are not good enough yet. However, it gaves me the best results for merge, that was with a lot of zeros in the other tests.
denoising-stats_3.qzv (1.2 MB)

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs trimmed-seq.qza \
  --p-trim-left-f 0 \
  --p-trim-left-r 0 \
  --p-trunc-len-f 0 \
  --p-trunc-len-r 218 \
  --o-table dada2_data/table-paired_3.qza \
  --o-representative-sequences dada2_data/rep-seqs-paired_3.qza \
  --o-denoising-stats dada2_data/denoising-stats_3.qza

What are your thoughts about these signs? Are there any other suggestions for trying to proceed with the paired-end analysis?

1 Like

This is the key observation!

If you trim so the reads are very short, this will increase quality, but if it's too short then there is not enough overlap for the reads to merge.

If you trim so the reads are a little longer, then a few reads will no longer pass the filter but more will merge overall.

If you trim so the reads are a much, much longer, then even more will be removed by the quality filter and errors can also prevent merging!

So it's a tradeoff. One simple thing to do is to look for settings that maximize the total reads merged. We have to optimize for something, and this seems reasonable to me.

I like your approach of trying many settings and compareing the results!

Have you calculated your expected amplicon length and expected read overlap?

1 Like

Hi @colinbrislawn,

So I've checked the mean length of my reads here:
rep-seqs-paired_3.qzv (531.5 KB)

But looking for my feature-table it looks like I have fewer ASVs as I would like, my past analyses I have between 4.000-5.000 ASVs and in that I have just 1.433 ASVs.
table-paired_3.qzv (471.8 KB)

Maybe a single-end approach would be better, but I'm confused about that. It is common for reverse reads to show a decrease in the quality of the reads? Or this is a condition of the sequencing? Because my last two sequencings my analyses were conducted by a single-end approach, due to bad quality of the reverse reads.

So I've checked the mean length of my reads here:

Great!

Next, how long is the amplicon you sequenced?

Then calculate the overhead as shown here!

It is common for reverse reads to show a decrease in the quality of the reads?

Yes, though it's usually not this bad. Take a look at this example from the Atacama soils tutorial.

Because my last two sequencings my analyses were conducted by a single-end approach, due to bad quality of the reverse reads.

Maybe a problem with the PCR, or the sequencing core, or something spooky! :ghost:

Thanks Colin!

However, my percentage of input filtered are too low along with a low number of ASVs.

So I've decided to test a single-end approach, but for my surprise a new problem:

A lot of zeros and almost nothing was filtered/denoised in my data:
denoising-stats.qzv (1.2 MB)

I've run cutadapt and use trunc-len in my reads at 245nt during dada2.

qiime dada2 denoise-single \
  --i-demultiplexed-seqs trimmed_seq_single.qza \
  --p-trim-left 0 \
  --p-trunc-len 245 \
  --o-table dada2_data/dada2_single/table-single.qza \
  --o-representative-sequences dada2_data/dada2_single/rep-seqs-single.qza \
  --o-denoising-stats dada2_data/dada2_single/denoising-stats.qza

What could have happened? :thinking:

Wow, that removed all your data!

--p-trunc-len 245

Dada2 throws out any sequences under this length, so any reads that area already shorter due to trimming with cutadapt will be tossed.

Keep exploring and report back! I can take another look later today.

I found the issue! There was an error, because after cutadapt my sequences were automatic trimmed for 237nt:
trimmed_seq_single.qzv (300.1 KB)

So it makes sense truncating in 245nt remove all my data.

With that observation, I've run my denoising and I think my results are good enough to moving forward with my analysis (reasonable to good percentage of input that passed the filter, plus a good number of ASVs ~4.500)
denoising-stats.qzv (1.2 MB)
table-single.qzv (597.9 KB)

qiime dada2 denoise-single \
  --i-demultiplexed-seqs trimmed_seq_single.qza \
  --p-trim-left 0 \
  --p-trunc-len 228 \
  --o-table dada2_data/dada2_single/table-single.qza \
  --o-representative-sequences dada2_data/dada2_single/rep-seqs-single.qza \
  --o-denoising-stats dada2_data/dada2_single/denoising-stats.qza

My last question is: What are the main impacts in my data, using a single-end approach instead of paired-end?

1 Like

Longer reads will lead to longer ASVs. This helps distinguish similar microbes, and including improved taxonomic resolution.

Maybe you need this, or maybe you don't.

Early Illumina sequencing was under 100 bp long, and it still got published. :person_shrugging:

1 Like

Thanks so much for your insigths Colin!