Need to trim primers but this will prevent merging sequences?

MicrobeManager · September 30, 2021, 7:36pm

I have a 16s dataset consisting of demultiplexed Illumina paired-end reads, each at 250 bp using primers 341F CCTACGGGNGGCWGCAG (17nt) and 805R GACTACHVGGGTATCTAATCC (21nt). I successfully imported the data manifest and was able to visualize quality scores for the samples.

Initially I planned to use dada2 without any truncation, but after reading further about parameters for truncation and looking at my data I thought this might be a good idea. However, the reads being only 250 bp each and the amplicon total length being approx 464 bp, this is only a 36 bp difference, and assuming minimum 20 bp are needed for overlap, this leaves only 16 bp left to truncate across both reads.

I also realized that the sequencing service didn't trim the primers from the 5' end of the reads, and that the 22nt f (17 bp primer plus what seems to be a 5 bp adapter before this) and 21nt r adapters were included in the 250 total bp read. I decided to trim these using the following cutadapt code anyway:

qiime cutadapt trim-paired
--i-demultiplexed-sequences G1_pairedend.qza
--p-front-f CCTACGGGNGGCWGCAG
--p-front-r GACTACHVGGGTATCTAATCC
--p-match-adapter-wildcards
--o-trimmed-sequences G1_cut_pairedend.qza

But I'm wondering if I can even use these cut sequences at all since there will now be a gap between forward and reverse when merging reads. It also seems that using cutadapt reduced sequence quality to some extent based on the interactive quality plot, but maybe I am misinterpreting this.

Here are some images of data quality scores:

So how much wiggle room do I have? I don't know if I can remove the adapters with cutadapt or do much truncation in dada2 without hindering sample merging across the board. Some advice would be appreciated!

SoilRotifer · September 30, 2021, 8:00pm

Removing the primers will not affect the overlap, as the primers are located on the 5' ends of the forward and reverse reads. It is the truncation of the 3' ends that is the important factor in determining how much overlap you require.

Only 12 bp are required. You can see the default settings when typing qiime dada2 denoise-paired --help

To better determine how much overlap you'd need I refer you to the following thread:

This is also a nice overview too:

-Cheers!
-Mike

MicrobeManager · September 30, 2021, 8:21pm

Hi Mike,

Thanks for the quick reply, and the explanation. I read the "best of" post you linked earlier in the day, but I didn't catch that 3' ends were of primary concern when it comes to truncation, I just assumed it was the total sequence length that mattered. Thanks for the clarification and the help!

MicrobeManager · October 1, 2021, 7:42pm

Hi Mike,

I tried a couple of things yesterday after I got your input but I am still stuck, so I could use some more guidance.

I went ahead and ran Dada2 on my post-cutadapt data but the result was very bad. Most samples had zero sequences pass the filter so I had to scrap that data.

I found out from the sequencing service that the primers and adapters HAD been trimmed from the sample, so I went ahead and re-ran Dada2 on the original data, truncating at 240f and 236r bp. This time about 60-80 percent of reads passed the filter, but only around 2%-10% of the sequences were determined to be non-chimeric.

I definitely see the primer sequences at the beginning of each read in the fastq files, but the sequencing service says that has all been trimmed out and that no further cutting should be done. It seems the chimera results indicate otherwise though? Maybe I am not reading this right.

I ran a more complete version of this dataset earlier this week using p-trim-left f and r of 13, and then leaving p-trunc-len-f and r as 250, and way more samples passed the filter and maybe 40-60% of reads were non-chimeric, so I think this result is better, but I am not sure if I'm leaving too many low quality sample ends in.

What do you think would be the best course of action?

SoilRotifer · October 1, 2021, 8:08pm

It's been my experience that most sequencing facilities do not understand what we mean when asking about primer removal. That is, they are likely removing the Illumina sequencing primers and adapters... not necessarily your primer sequence. It helps to be very explicit when asking your sequencing facility what they've done. Make sure they are very explicit with you too.

I think this sounds reasonable. However, after looking at your quality score plots again, they look a little suspicious. Are you sure the data was not quality filtered prior to getting into your hands? Can you share the QZV file of the DADA2 stats output. I'd like to see where the data is failing. That is, is it failing due to quality control, or failed merges? If failed merges, then you'll have to play with the truncation settings a bit more.

If trying multiple truncation / DADA2 settings does not help, you can always analyze your data using only the forward reads. I've had to do this myself on occasion, when my reverse reads were really bad and I had no hope of merging the reads.

MicrobeManager · October 1, 2021, 9:16pm

I see what you mean. This is my first time doing 16s analysis on my own, so I will keep that in mind for next time.

I also had this thought. The reads are all exactly 250 bp, so I figured they had already been trimmed and maybe qcd already. The interactive score plot still showed some lower quality sequences past ~240 bp though so I wasn't sure if these needed additional filtering. The email I got from the sequencing service says these are the "raw" files, but as I said, it seems the sequences have already been filtered. I just tried another DADA2 run with this limited set (using trim 12 f and r and trunc 250 again) but the chimera scores are still pretty high. Most sequences seem to be passing the filtering stage fine, it's the merging step that seems to result in failure.

I will DM you the QZV files I have.

SoilRotifer · October 1, 2021, 10:14pm

Thanks for sending the info via DM @MicrobeManager!

I find it odd that the files trimmed with cutadapt has so few reads. One way to check how and where cutadapt is trimming is to make use of the --verbose option. This will print to screen all of the cutadapt processing for each sample. Also, keep in mind you'll want to truncate the low-quality ends. That is, you'd probably want to truncate the forward reads somewhere between 200-210. I think part of the reason why so many of your reads are being filtered is because you are retaining the low-quality bases at the 3' end.

Also, this reminds me to ask you: did you view the quality plot after running cutadapt? I am guessing that you are setting the truncation value still assuming the original full-length of your raw reads prior to running cutadapt. That is, after running cutadapt, your reads are likely going to be ~20 bp shorter after primer removal. Thus, most of your reads will fail as they are shorter than the 230-240 bp truncation length you are setting. That is, DADA2 and other tools will discard reads that are shorter than the specified truncation value.

In brief, I often import my reads, then run cutadapt, then demux (or not if they are imported as already demuxed data). The order for cutadapt or demux doesn't really matter here. Either way, I like to view the quality scores at some point after cutadapt, but prior to denoising. Then decide my trim / truncation values prior to denoising.

If you look through the tutorials for example quality score plots, e.g. this Atacama example, or elsewhere on this forum, you'll see why I think your quality score plot is suspicious.

Also, I forgot to mention that you should add the --p-discard-untrimmed to ensure you are only keeping the data in which you can find and trim the primer sequences. Otherwise you'll run the risk of keeping sequence data in which the primers are still within the sequences, but may have too many errors / mismatches for cutadapt to "find and remove". This can potentially have a negative effect on how DADA2 decides which sequences are chimeras. My assumption here, which could be wrong, is that you may have many sequence that were trimmed properly, and others that were not ... essentially confusing DADA2 to falsely claiming more chimeras than there should be, as these sequences are basically identical except for the primer portions.

Hopefully this helps. Otherwise, I'm not sure what else to suggest. Other than continue one with just the forward reads. At least you can use that as a sanity check for your data.

MicrobeManager · October 4, 2021, 2:42pm

I checked the quality plots on the post-cutadapt data (this was included in my first post, second image) and I see that the lengths for the forward reads are 228-229 nts and 229-230 for the reverse. The reverse reads also look worse than before.

I think this would indicate that all primer sequences were removed, since there are no values that are close to the original 250. I suppose I could try to truncate these at 228 and 229 nts but this would prevent the samples from overlapping, which is why you suggest using forward reads only, correct?

This is why I asked my original question, since I figured trimming the primers would have an effect on the sequence length and affect truncation and merging. The "best of" post also mentions that trimming on the 5' end will affect overlap, (the -2 for the 5' trim is included in the equation).

From the post:
216 forward + (218 - 2) reverse - (16S-341F and 16S-805R) amplicon = overlap
216 forward + 216 reverse - (805 - 341) amplicon = overlap
432 - 464 = overlap
-32 = overlap

I used trim lengths of 12 for f and r in my most recent DADA2 run, leaving a 12 bp overlap with no truncation, but again this might be leaving in too many low-quality sequences. I realize cutadapt is more appropriate for removing primers than p-trim-left, but in the end does it make a difference since I am still losing too many bp to merge?

I have definitely seen that my quality plots look quite different from others on the site. It seems that these data were trimmed down before I got them...I wonder if above 250 bp the quality significantly dropped off so they trimmed them around this mark. In that case could I just use the whole sequence given, trim off the primers, and call it a day? I don't have much leeway to trim the sequences, only 12 bp or so before losing the 12 bp overlap for merging (or I'll just use forward reads).

SoilRotifer · October 4, 2021, 5:31pm

Correct.

Presence of the primer at the 5' end only affects the overlap calculation itself, not the actual process of merging reads, because the merging process only cares about the bases in the region of overlap (aside from all the other quality control stuff). That is, you'd simply update the calculation to take into account primer removal, by subtracting out the primer lengths from both your sequence data and that of the expected amplicon length. If your expected amplicon is 464 bp, and your primers are 38 bp (i.e. 17 + 21) then your new expected amplicon length is 426 bp (assuming the amplicon length was originally calculated containing the primers). Thus, your overlap calculation would be based on this value. Does this make sense?

Hard to say, but you can proceed with DADA2 / deblur trimming option. I've had to do this when the sequencing company would not provide the primer sequences (proprietary), but the did give me the trimming positions. In this case I had no choice but to use the trimming features of DADA2 / deblur.

I am thinking it may be safer to simply just use the forward reads. I feel that you may be losing too much data by trying to merge them. Again, many users, including myself, have processed data by only using the forward reads. I'd strongly suggest that you processing the data two ways: by using the forward reads only, and with your merged data. From there you can make a judgement call as to which best fits your expectations.

MicrobeManager · October 4, 2021, 6:15pm

This finally makes way more sense. I didn't realize I should subtract the primer length from the expected amplicon, I thought it was only subtracted from the read length. The forward primer has an additional TGCGA at the beginning, making it 22 bp. So my expected amplicon would then be 464 - 43 = 421 bp. Each read is 250 bp, so the new forward would be 228 and the new reverse would be 229 after cutting. So: (228 + 229) - 421 = 36 overlap.

I just ran DADA2 again with the cut sequences and set trunc-len-f to 228 and trunc-len-r to 229 and I got these results:

II'm not sure if some of the percentages are too low (I don't honestly know what a good percentage filtered / non-chimeric is based on number of starting reads) but it looks much better than before for the cutadapt modified data. I might try to trim them down more and see how that goes, and do a set with forward reads only as well.

SoilRotifer · October 4, 2021, 7:01pm

Oh wow, this is a great improvement! Nice work!

It appears that you are averaging ~ 70%, which is what I've hit with a few data sets. I think this might be sufficient for you to proceed. Of course, you can always play around a little more, but as long as the data is sufficient to address your research questions you might be good to go.

Sounds like a good plan. Keep us posted!

MicrobeManager · October 4, 2021, 7:19pm

Yes, this is much better! Thank you for your time and help, I have learned a lot from this discussion!

system · November 5, 2021, 1:19am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.