Finding Primers In Raw Files and Quality Control

Pauline_Trinh · October 26, 2017, 1:41pm

I'm just switching over to QIIME2 and excited about using DADA2 but I'm a little confused about two things.

I've inherited some fastq demultiplexed paired end files and I'm uncertain if the primers are still in these raw files. Is there a way for me to identify them in the raw files or do I need to ask the sequencing center if the primers were sequenced?
I'm still uncertain as to where to trim the forward and reverse reads. The quality score of my reads in my forward reads looks pretty consistent whereas my reverse reads start to drop off a bit at 240. There also seems to be a weird little blip at the beginning so I was thinking of doing...
–p-trim-left-f 10
–p-trim-left-r 10
–p-trunc-len-f 260
–p-trunc-len-r 240

Is that reasonable? What should I be considering about my data that would help me determine where to trim?

Thank you!

Nicholas_Bokulich · October 26, 2017, 2:15pm

You will need to just check the files to be sure, though asking the sequencing center may be more straightforward (especially if your primers contain degenerate bases). If you have the raw .fastq files (before importing into QIIME2) and your primer does not contain degenerate bases, you could type the following command into your terminal (replacing ACGTACGT with your actual primer sequence):

grep 'ACGTACGT' path-to-your-fastq-file.fastq | wc -l

That will list the number of lines your primer sequence is detected in, which should give a a pretty good idea (if the number is very large, or precisely 1/4 the length of the total file, then your primer(s) are still in the reads). If you do have degenerate bases, you could use BLAST to search for your primers in your sequences (we still don't have a method to do this in QIIME2 on raw sequences, just FeatureData[Sequence] data, but may support this in the near future). The easiest/quickest way to do this would be to just BLAST the first few sequences in the file (unless if you can think of a reason why you'd need to BLAST them all). Pull out the first 5 sequences with this command:

head -n 20 path-to-your-fastq-file.fastq | grep -x '[ACGT]\+'

Your parameters look perfect. These quality profiles look very good (and it is normal for the reverse reads to have slightly worse quality and for that little blip at the start of the sequences). You can check out the dada2 documentation for a little more detail on trimming decisions but in a nutshell you already grasp the point — trim the sequences where data starts to drop off substantially (I usually look out for quality score = 20 as a rule of thumb; so you may even be able to trim around 280 in your forward sequences), and if you have a little "blip" at the start of the sequence you can trim that too (your "blip" looks practically non-existent compared to some — you could probably just leave it in and see what happens).

I hope that helps! Good luck!

fstudart · October 27, 2017, 5:33pm

Hi Everyone,

Considering the quality control plots above, I'm curious what the best p-trunc-len-r option would be? I know that we do not have to use, as explained in the tutorial, the same option for the forward and reverse reads. But, does it matter, when merging the reads, if we use values too different (for example, 280 for the forward reads, and 240 for the reverse reads)?

Thanks very much,
Fernando Studart

ebolyen · October 27, 2017, 5:57pm

Hey @fstudart,

That's a great question. The merging is really only looking for 20 nucleotides of overlap, so as long as your truncation parameters on the forward and reverse still leave that much overlap (and leave enough room for biological variation) there shouldn't be a problem.

fstudart · October 27, 2017, 9:55pm

Hi Thanks very much for the explanation.

FS