Trimming and joining short reads resulted in few merged read on v4 region

simon · July 30, 2020, 3:27pm

Dear Qiime team,

I am trying to merge paired-end reads but only few of my reads actually merge and I don’t know if it is due to my script or is it because my reads are too short? My problem is similar to this post ('joining-2x150-bp-v4-paired-end-reads-from-iseq100') but I also have question about the trimming process.
I have used 515f and 806r primers for the V4 region. I have 2x150 reads
Qiime 2020.2 installed on Virtual box

Here is what I am doing

qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path metadata.csv
--output-path paired-end-demux.qza
--input-format PairedEndFastqManifestPhred33V2

#to reduce computing time I have subsampled 10% of my sequences
qiime demux subsample-paired
--i-sequences paired-end-demux.qza
--p-fraction 0.10
--o-subsampled-sequences sub-paired-demux.qza
--verbose

sub-paired-demusub-paired-demux.qzv (291.6 KB) x.qzv

qiime cutadapt trim-paired
--i-demultiplexed-sequences sub-paired-demux.qza
--p-front-f GTGCCAGCMGCCGCGGTA
--p-front-r GACTACHVGGGTATCTAATCC
--o-trimmed-sequences trim-sub-paired-demux.qza
--verbose

(only the last samples, all present similar results)
=== Summary ===

Total read pairs processed: 93,436
Read 1 with adapter: 91,279 (97.7%)
Read 2 with adapter: 90,113 (96.4%)
Pairs that were too short: 0 (0.0%)
Pairs written (passing filters): 93,436 (100.0%)

Total basepairs processed: 28,030,800 bp
Read 1: 14,015,400 bp
Read 2: 14,015,400 bp
Total written (filtered): 23,960,704 bp (85.5%)
Read 1: 12,100,395 bp
Read 2: 11,860,309 bp

=== First read: Adapter 1 ===

Sequence: GTGCCAGCMGCCGCGGTA; Type: regular 5'; Length: 18; Trimmed: 91279 times; Reverse-complemented: 0 times

No. of allowed errors:
0-9 bp: 0; 10-18 bp: 1

Overview of removed sequences
length count expect max.err error counts
3 42 1459.9 0 42
10 1 0.1 1 0 1
11 31 0.0 1 30 1
14 4 0.0 1 2 2
15 16 0.0 1 12 4
16 16 0.0 1 5 11
17 197 0.0 1 168 29
18 304 0.0 1 241 63
19 1250 0.0 1 902 348
20 6432 0.0 1 3587 2845
21 79589 0.0 1 77967 1622
22 3332 0.0 1 18 3314
23 1 0.0 1 0 1
25 2 0.0 1 2
56 1 0.0 1 1
120 2 0.0 1 2
121 7 0.0 1 7
122 5 0.0 1 3 2
123 10 0.0 1 10
124 16 0.0 1 16
141 4 0.0 1 4
142 1 0.0 1 1
146 6 0.0 1 6
147 10 0.0 1 10

=== Second read: Adapter 2 ===

Sequence: GACTACHVGGGTATCTAATCC; Type: regular 5'; Length: 21; Trimmed: 90113 times; Reverse-complemented: 0 times

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-21 bp: 2

Overview of removed sequences
length count expect max.err error counts
3 47 1459.9 0 47
4 2 365.0 0 2
12 1 0.0 1 1
15 1 0.0 1 1
16 4 0.0 1 2 2
17 1 0.0 1 0 1
18 6 0.0 1 1 5
19 14 0.0 1 3 9 2
20 111 0.0 2 24 61 26
21 214 0.0 2 42 81 91
22 763 0.0 2 139 270 354
23 3960 0.0 2 579 1873 1508
24 84848 0.0 2 31532 50828 2488
25 139 0.0 2 19 49 71
26 1 0.0 2 0 0 1
27 1 0.0 2 0 0 1

trim-sub-paired-demux.qzv (296.8 KB)

At this step it seems to me that the forward and reverse reads are correctly trimmed of the primers, is it right ? However the quality-score plot warns me that the minimal sequence length for the forward reads is 3bp… is there something wrong?

For the merging, I use a minimal overlap length of 7 because the length of my reads minus my primers barely overlap

qiime vsearch join-pairs
--i-demultiplexed-seqs trim-sub-paired-demux.qza
--p-minovlen 7
--p-maxdiffs 25
--o-joined-sequences joined-paired-demux.qza
--verbose

(for the last sample)
Merging reads 100%
93436 Pairs
217 Merged (0.2%)
93219 Not merged (99.8%)

Pairs that failed merging due to various reasons:
56626 too few kmers found on same diagonal
9 potential tandem repeat
36535 alignment score too low, or score drop to high
49 staggered read pairs

Statistics of all reads:
128.22 Mean read length

Statistics of merged reads:
239.70 Mean fragment length
36.02 Standard deviation of fragment length
0.32 Mean expected error in forward sequences
0.45 Mean expected error in reverse sequences
0.50 Mean expected error in merged sequences
0.15 Mean observed errors in merged region of forward sequences
0.28 Mean observed errors in merged region of reverse sequences
0.43 Mean observed errors in merged region

What means “too few kmers found on same diagonal” and “alignment score too low, or score drop to high” ?
Beside, using dada2 after the trimming also resulted in only few merged reads

Thanks a lot for your help

SoilRotifer · July 31, 2020, 2:34pm

Hi @simon,

If you are indeed sequencing via a 2x150 kit with the EMP V4 (515f-806r primers), and those primers are actually contained within the reads... then sadly these reads will either be to short, or not have enough enough high-quality bases, for merging.

The 2x150 sequencing approach works best when using the sequencing strategy outlined here, and the references therein. That is, when using a 2x150 for the V4 region, the sequencing strategy used should NOT sequence through the primers. Meaning, there would not be any primers in the reads. So no need for cutadapt. This saves ~40 bp, which increases the ability to sequence further into the read, allowing for longer overlapping segments, for better merging of the reads. This saves costs too as you can get away with a 2x150 sequencing run.

Otherwise, if the primers are contained within your reads (which have to be trimmed via cutadapt), you'll often have to use a 2x250 sequencing run to guarantee longer reads with enough overlap in order to merge.

At least these are common factors I've run into when using 2x150 vs 2x250 for sequencing the V4 region. Hopefully, others will have recommendations / comments.

simon · July 31, 2020, 4:01pm

Thank you for your answer @SoilRotifer,

Based on the cutadapt verbose it seems that my primers are still in my 2*150 reads... right? It is maybe an obvious question but I'm not familiar with these pre-processing steps

SoilRotifer · July 31, 2020, 5:23pm

No problem @simon.

That would be my assessment too, especially considering that cutadapt was able to find primer sequence in the majority of the R1 & R2 reads.

-Mike

simon · August 4, 2020, 2:25pm

Thank you Mike,

Just to let you know (or someone who is interested in) I have used Fastq-join with 3 bp overlap and was able to merge ~80% of my reads. However, I am still doubtful if it is the right thing to do...

SoilRotifer · August 4, 2020, 2:54pm

Hi @simon,

That is indeed very little overlap. You should be able to at least "sanity check" the sequences.

I'd suggest running deblur and see how many sequence you retain. Make sure the sequence lengths make sense, that is they should be ~ 254 bp give or take. Then assign taxonomy to these sequence variants and see if the assignments make sense. I would compare these to running deblur only on the forward reads. The taxonomy should be similar between the two, expect that you may be able to obtain lower-level taxonomies (i.e. family, genus) more often compared to only the forward reads.

Finally, a good old-fashioned BLAST check would help too. You can do this by running qiime quality-control evaluate-seqs ... to see how well your sequence variants align to the SILVA database.

-Mike

system · September 4, 2020, 8:54pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.