How to remove primer with different length

Lei · May 12, 2019, 11:30pm

Hello everyone,
I have a questions regarding how to remove primer before running data2 (or use the trim command to remove primer). The primer set I used is 515f/806r The sequencing center told me that :

"the primers will vary a bit in size, that is, by 1, 2 or 3 bases. The reason is that Illumina's software does not do very good when the first few bases are the same, which is the case when we use the same universal primers for all the samples. So we (and others) have approached this by adding 1, 2, or 3 Ns at the beginning of the primer. That helps the software make the right base call and helps us obtaining more real data. "

So the primers are look like the following image:

So as a result, the primer F4 has three additional N compared to F1, F3 has two additional N compared to F1, and F2 has one additional N compared to F1. The final sequence length also varied from 291-294 bp. The sequencing center told me that they have already remove barcode. So what I need to remove is the primer (showing the letter after dash in the image).

I saw in the forum that people use --p-trim-left to remove primer. However, in my case, the position of my forward primer varied from 19 to 22 and the reverse primer varied from 20 to 23. If I using the following command : --p-trim-left-f-22 and --p-trim-left-r-23. They can remove all the primer but for some samples, additional Oligonucleotide can also be removed. If I did this, all the sequence after trim can also be different length since their original length is different and the length being remove is the same.

Do you think this is a good way (using Data2) to remove primer in my situations? Will this affect the other following analysis. Can you please give me some suggestions regarding how to remove the primers in my cases (either using Data 2 or other primer removal program).

Thank you for your help:smiley:.

Mehrbod_Estaki · May 13, 2019, 1:35am

Hi @Lei,
These heterogeneity spacers a great idea that are unfortunately underused in my opinion in the field. As so, there isn't a way to demultiplex these variable length barcodes in Qiime2 yet, though it has certainly been brought up before and is somewhere in the developers' radar, though I wouldn't hold my breath for this, I don't think its too high on the priority list.
That being said, you can demultiplex these outside of qiime2 with other tools like mothur and bcl2fastq and @nounou's custom code which may be of help. You can even hack something in Qiime2 by demultiplexing your samples based on the barcodes, then separate the samples into groups based on the # of extra Ns they have, then use cutadapt 3 times separately to remove the 3 variations of the the primers+spacers.
There is likely more options out there too but those should get you started. Ultimately, it is important that you do remove these before running DADA2 as otherwise you are going to call alot of incorrect ASVs since:

   GGTTCCAA
  NGGTTCCAA
 NNGGTTCCAA
NNNGGTTCCAA

Will be called 4 different features even though they should all be the same.
Good luck!

Nicholas_Bokulich · May 13, 2019, 2:44am

Since the barcodes are removed, the concerns about demultiplexing that @Mehrbod_Estaki raised are no longer a concern.

You can absolutely use QIIME 2 to accomplish your goal of removing the primers and variable spacers. Just use cutadapt once; since the variable spacer is at the 5' end of the primer, just use the primer as the "front" adapter with cutadapt trim-single and the primers (and all preceding bases, including the variable spacer) will be removed. See the tutorial here and use the --help flag to see the full usage details for cutadapt:

Lei · May 13, 2019, 2:45am

Thank you for your explanation @Mehrbod_Estaki.
I have two additional follow up.

So the commend --p-trim-left-f-22 and --p-trim-left-r-23 in Data 2 which can removed all the primers will not work in your opinion (since it will result in different length of the processed sequence)? I asked to one of my colleague and he told me that it is fine if you are OK to sacrifice some sequences. And Illumina sequencing itself cannot do good job at the beginning so the quality might be low and it is necessary to trim the the beginning part of the data. How do you think about this?
I also thought about the barcode method as you suggested. Do you think I can just consider those (extra spacer + primer) as barcode and create a text file to link them to specific samples and follow Demultiplexing steps in Qiime 2? My concern here is that then the barcodes will have three different length. Will the command --qiime demux can handle different length of barcodes.

Thank you for your help!

Mehrbod_Estaki · May 13, 2019, 2:55am

Hi @Lei,
I jumped the gun there a bit, I didn't read your initial post carefully enough and missed the part that your reads are already demultiplexed. Thanks @Nicholas_Bokulich for being on top of these as always. See his much much easier solution above.
But to answer your question, you should not trim them all the same length if they have variable primer length for the same reason I mentioned above. ASV methods are sensitive to even a single Nt difference so you want to avoid that. It might be less of an issue if you were OTU clustering but not with ASVs.

Lei · May 13, 2019, 5:13am

Thank you for the explanation @Nicholas_Bokulich @Mehrbod_Estaki
I have read through the tutorial but still confused:

If I used the following cutadapt command which written in the tutorial. What should I write for the --p-front . Should I include all the N in it. In my case, can I use --p-front NNNGTGCCAGMGCCGCGGTAA ? But I am confused here. Actually those N is not N, it represents any of the A,G,C,T in actual fastq file. If I just type N in the command --p-front. I do not think it will work. Can you please let me know the way more specifically.
$ qiime cutadapt trim-single
--i-demultiplexed-sequences demultiplexed-seqs.qza
--p-front GCTACGGGGGG
--p-error-rate 0
--o-trimmed-sequences trimmed-seqs.qza
--verbose
Or if I am wrong in the above, should I separate my samples into 4 groups based on the length of the primers and use the cutadapt command 4 times separately to remove primers?

Thank you for your help.

Best Regards,
Lei

Mehrbod_Estaki · May 13, 2019, 5:33am

Hi @Lei,

I think what @Nicholas_Bokulich's solution suggests is that you don't worry about those spacers at all and simply put in your actual primer sequences which is the same in all your samples, since cutadapt will just remove everything (regardless of length) that preceeds that sequence.
For example if your primer is GTGCCAGCMGCCGCGGTAA (please properly double check this sequence from your files) and the spacers are N, NN, NNN before it, then cutadapt will look for GTGCCAGCMGCCGCGGTAA and remove that and everything else on the 5' of that, regardless of the length, so all of your Ns, (regardless of their true value) will be removed.

Lei · May 13, 2019, 12:19pm

Thank you all for your help. I really appreciate it.
I will have a try and keep the forum posting on my progress.

Lei · May 14, 2019, 4:00pm

Hi,
I am here again to ask more questions. I am new to Qiime2. Thank you for your patience to answer my questions.

I have used the following cutadapt to trim in the (primer+extra spacer N) in my sequence. The primer I used is 515f/806r.
cutadapt trim-paired
--i-demultiplexed-sequences demux-paired-end.qza
--p-front-f GTGCCAGCMGCCGCGGTAA
--p-front-r GGACTACHVGGGTWTCTAAT
--verbose
--o-trimmed-sequences trimmed_remove_primers.qza
I compared the interactive quality plot of demux-paired-end.qzv (before remove the primers using cutadapt) with trim_remove_primers.qzv (after remove the primers using cutadapt). I am wondering why there are so much differences between the two plots. Did I do the remove primers steps correctly? More specifically, my questions are:

Why after I remove primers, there are red box appeared in the plot. But before removing, all the box are blue (if I zoom in). I have read several post in the forum and I am be able to understand "Red box plots (and the associated red warning text) should only appear when your input sequences have different lengths".

Question: The sequencing facility told me that "The products (primers included) should be around 291-294 bp" . But why I saw all blue box in the "demux-paired-end" plot (means equal length?) Why there are red box appeared after I used cutadapt? Did I remove primer correctly?

Questions: The quality plot of "demux-paired-end" looked strange compared to other quality plot I saw in the forum. @thermokarst mentioned in this thread that this kind of quality plot with narrow band of high-quality scores seems already be filter? Am I right? But I did not do any filtering steps. The "demux-paired-end" file is the file which I imported in Qiime 2 using commend qiime tools import. Since our sequencing facility told me that they already remove barcode so I did not go through the demultiplexing step.

Questions: If the red box make sense to appeared after removing primers. Do I need to get rid of all those red box by using --p-trunc-len when doing DADA2 filtering to make the sequence equal length (I am sorry if my questions is so basic to you. I am really trying hard to understood how to select the correct parameter for the DADA2 and understand the interactive quality plot. I have read many posts in the forum but still confused. )

If the trim_remove_primers.qzv looks fine, what parameter I should select? The following is the parameters I am thinking.
--p-trim-left-f 0\ (Seems no need to trim since I ready trim the primer in the cutadapt steps)
--p-trim-left-r 0\ (Seems no need to trim since I ready trim the primer in the cutadapt steps)
--p-trunc-len-f 0\ (I did not consider remove the red box)
--p-trunc-len-r 230\ (I did not consider remove the all the red box)
--p-n-threads 0\ (all available core will be used which can reduce the running time)
--p-n-reads-learn (keep default)

image1792×599 96 KB
Though I saw some thread in the forum mentioned these,I still want to confirm my understanding:
A. DADA2 will automatically do the jion (merge) front and reverse sequence for us and the output should be the joined sequence.
B. In my case in the primer set is 515f/806r which should cover around 291 bp. Based on the Qiime 2 forum, I need 30bp overlap. So I need 291+30~320 bp after truncation. That is trunc-len-f + trunc-len-r>320. In my case, I do not need to worry about I will not have enough overlap sequence since both the reverse and front reads are larger than 240bp?
C. Do I need to keep all the sequence the same length before running DADA2? Based on my understanding, different sequence length can affect the ASV clustering processes??

Attached file:
demux-paired-end.qzv (290.3 KB)
trimmed_remove_primers.qzv (295.9 KB)

Thank you so much for your help. I really appreciate that!

Mehrbod_Estaki · May 15, 2019, 7:38pm

Hi @Lei,

No problem, patience is is the foundation of progress!
Your cutadapt command looks ok to me. We can tell because if you look into the length summary of those plots you will see that originally all your reads were about 251 nts long and after they are about 229-231 nts. That 20 nts difference is the length of your primers you trimmed. All good so far.

Those red boxes you speak of only appear in the tails of your quality plots and when you hover over them the description below changes so you can read what it is referring to. Those quality plots are produced by subsampling 10,000 reads out of all the reads you have. Should any of those reads not be as long as the rest/what the plot is showing you will see that warning informing you that some of your reads were not that long. In your forward reads this begins happening almost exactly at the same place where you would trim your adapters which makes sense. A very small percentage of your reads didn't have your primers in them so they were never trimmed. What those reads are is not exactly known, some sort of contamination or chimera perhaps but it shouldn't matter as those will likely be discarded downstream anyways. They also have much lower quality score as well so again I wouldn't worry about these since they'll be dealt with later. Unless something weird happens during your merging steps of DADA2 I'd say you're good.

I would agree that this plot seems like it has gone through some sort of filtering already. You can always ask your facility if they have any sort of filtering in their pipeline. It is also possible that the reads are just that good in quality in which case , awesome.

This is always hard to say, as it is data-dependent and might take a couple of tries. But given your plots I would certainly avoid those poor quality tails we just talked about so truncate forwards to 220 and just before 200 on your reverse. See how that goes.

As for your other questions. Yes, DADA2 does join your reads after denoising. It does require a minimum of 20nt overlap, but more is always better if you can afford it.
you have 2x250 runs with a an amplicon size of 291. So 500-291=209 overlap. So as long as what you truncate is not over 209-20=189 nt combined you're good. So you shouldn't worry about overlaps at all.
You don't need to keep the sequences the same length before DADA2 as that's what the truncate function is for and also with paired-end you will also end up with some slightly different length sizes due to the natural variation in that region. Though this should be pretty minimal.
Hope this clarifies things a bit, good luck!

Lei · May 17, 2019, 2:41pm

Hello Mehrbod,

Thank you so much for your extensive explaination @Mehrbod_Estaki ! It solved most of my queries! I did run the DADA 2 analysis and have some issues with the RAM as I post here. After increasing the RAM，I did go through the DADA 2 successfully and finished the some basic analysis followed the "moving picture" tutorial used my own data.
I did try two different run:
Run 1:–p-trunc-len-f 0
–p-trunc-len-r 230
Run 2:–p-trunc-len-f 220
–p-trunc-len-r 200 (As you suggested)

I did not get good results for run 1. After run 1, around half of my samples had less than 500 reads!
I did run 2 followed the parameters you suggested. I got reasonable results after DADA 2 denoised. All the read number make sense to me now. I want to say that always make sure to remove all the bad quality data. Don't try to keep them. If you do not remove those, you will lose more reads after denoised step .

Now I still have some follow up questions want to ask:

When I ran DADA2 using the parameter in run 1, there are message "all the sequence are not the same length" continuously appeared. But I think they are not error message and they won't affect the process? So my questions here is that Is this message just trying to inform me that I have different length in the sequence. This different length won’t hurt the denoised step??
The second questions is related to the sequence read. One of my sample had very low sequence read around 2000 (all other samples had >10000) in the raw fastq data. In my run, I still keep this sample and put it into DADA 2 analysis. I understand that DADA2 seems to use some machine learning algorithm to train the data. So I am wondering whether there are differences keep this sample with low reads to run DADA2 and remove this sample from the entire dataset before running DADA2? I think if DADA2 process everything individually, this sample with low read won't affect the results??
Regarding the sequence read, in your opinion, what should be threshold ? My previous colleague told me that he usually discard the samples with sequence read < 5000. Do you think this is a reasonable value?

I saw Mark in this post compared the difference among DADA2, Deblur, and Vsearch method using Qiime2. I also plan to this comparison using my own data in the next and post my follow up.
Many thanks

Mehrbod_Estaki · May 17, 2019, 6:42pm

Hi @Lei,

A valuable lesson learnt! And I couldn't agree more. Quality over quantity. I'm willing to bet you might even get a bit more reads if you were to truncate even a bit more. But probably just a little, not too much.

Correct. This is more of a warning than an error, which started popping in the most recent version of dada2. I'm not sure if this message comes before truncation or after. If before its probably because the primers removed were variable in length. If after it might be referring to existnece of some reads that were just shorter in length for whatever reason (it happens..). Eitherway since you have PE reads you are bound to end up with variable length amplicons anyways due to natural variability and if there were any problematic reads that were not real would have been removed when they failed to merge anyways. So, no need to worry about that.

This is a very hard question to answer because it depends on your samples and the expected community diversity. The topic has been covered before on the forum so you could search for that and read but I would say you are totally fine with 10,000 sequence per samples. Even the sampe with 2,000 reads might be ok to include if your samples are not from a super diver environment like saline waters or soil. If these are say mouse or human samples that should be ok, but if you have good n values for your groups then removing it is justified too. up to you.

Great! Always useful for the rest of the community. Looking forward to reading it.

system · June 18, 2019, 12:42am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.