Difference in sequence length: is it a problem?

Hello guys, hope I am at the right category.
I am new to metagenomic analysis (and qiime) and I am trying to understand the whole quality scores/trimming/truncating/filtering procedures.

This is a print of my interactive quality plots:

There is a 100 nts difference between forward and reverse reads. Is it a problem? Should I trim those extra 100 nts from the forward reads?
I believe I have good quality scores, except for the last position of the forward reads, so should I only trim that last one?

Also, I know I should trim out my primers, but how can I check if there are primers to trim and their length?

I am sorry for the rudimentary questions. I have searched for the answers in the forum and a lot of my questions were answered, but now I got to this point where I am very confused and tangled in all those concepts and procedures.

I would highly appreciate your patience and help.

Hi,

The read quality looks good. It's not common that the reverse reads are much shorter than the forward reads. Do you know what happened during the sequencing? Did the sequencing stop earlier than it should have been?

Based on your quality plot, if you trim your primers and truncate the last 10 nt of both reads you should end up with 430-440 nt after merging forward and reverse reads. If the expected amplicon size of your study + 20 nt (the recommended minimum overlapping length) is less than 430-440, you should be fine. If not, you can discard the reverse reads and use the forward reads only at the cost of lower taxonomic resolution.

No. You just need to trim the last few nucleotides, say 10 nt.

You can ask your sequencing service provider for the primer sequences. To check if the primer sequence is present in your reads, which I suspect is, you can check this forum post:

1 Like

Hi @yanxianl Thank you for the response!

The sequencing was done by a sequencing company that used to do our analyses, but we no longer work with them. I only have the fastq files they provided us and no information about the sequencing procedure itself.

How is the expected amplicon size defined? Sorry, I am just learning about all this now, since I was left with raw data to analyse and there are lots of details coming up that I do not know about.

Is there a way to do this without knowing the primer sequences? Which kinds of problems could my analyses have if the primers were not trimmed?

The amplicon size depends on the primer set used in the studies. The expected amplicon size can be calculated by running a digital PCR (see this forum post). For example, if you're using V4 primers the expected amplicon size should be ~252 nt; if using V3-4, the amplicon size could be ~450 nt.

The easiest way would be asking the sequencing company for it. They shall have good answers. Alternatively, you can grab all the commonly used universal primers for the 16S rRNA gene and blast them against your raw fastq files following the method I linked in the previous post.

The DADA2 won't work properly if primers are not removed from your reads, e.g., losing the majority of reads during the chimera removal. See the DADA2 tutorial for more details.

3 Likes

I found the primer sequences and ran the command from the post you suggested. This is what I got:

(qiime2-2022.2) [email protected]:~/bovine-reads$ grep 'CCTACGGGRSGCAGCAG' 210511155290-1-2-1_S366_L001_R1_001.fastq.gz | wc -l
0

Does this "0" mean the primer is no longer in the sequence? Could it be a problem due to the file being zip?

So I should use --p-trim-left-f and --p-trim-left-r for this, right? The tutorial “Atacama soil microbiome” tutorial — QIIME 2 2022.2.0 documentation says they are not going to trim the ends of the sequences and uses --p-trunc-len- 150 (which is the full length of the reads). But I have seen on the forum researchers who were not truncating either using --p-trunc-len- 0. Should the integer refer to the number of nts being trimmed, like this:

--p-trim-left-f 10
--p-trim-left-r 10

or should it refer to the position where the cut is supposed to be made:

--p-trim-left-f 295
--p-trim-left-r 195

You need to unzip the fastq file. If the primer sequence, 'CCTACGGGRSGCAGCAG', is for the forward reads, you can try running:

gunzip 210511155290-1-2-1_S366_L001_R1_001.fastq.gz
head -n 20 210511155290-1-2-1_S366_L001_R1_001.fastq | grep -x '[CCTACGGGRSGCAGCAG]+'

If the primer has not been trimmed from your raw data, you shall see sequence printed on the terminal.

No. You should use --p-trunc-len-f/--p-trunc-len-r to trim sequences at the tail, or the 3' end. For example, the following code trims your forward reads at position 295 and reverse reads at position 195:

--p-trunc-len-f 295
--p-trunc-len-r 195

You can use the read length here if you don't want to trim the 3' end sequence.

Use --p-trim-left-f/--p-trim-left-r when you want to trim your primer sequence at the head, or the 5' end. For example, the following code trims the first 17 and 21 nt of your forward and reverse reads respectively:

--p-trim-len-f 17
--p-trim-len-r 21

2 Likes

Do I need to unzip the fastq files to create the qiime artifacts as well and to do all the analyses? Or just to look for the primers?

No. You can run your analysis using zipped fastq files. This is just for checking the primers.

1 Like
(qiime2-2022.2) [email protected]:~/bovine-reads$ gunzip 210511155292-1-2-1_S160_L001_R1_001.fastq.gz head -n
 20 210511155292-1-2-1_S160_L001_R1_001.fastq | grep -x '[CCTACGGGRSGCAGCAG]+'
gzip: head.gz: No such file or directory
gzip: 20.gz: No such file or directory
gzip: 210511155292-1-2-1_S160_L001_R1_001.fastq: unknown suffix -- ignored

I tried typing the code with \ and in different lines, but it didn't work either. What am I doing wrong?

I also tried to unzip the file manually and then run the code:

(qiime2-2022.2) [email protected]:~/bovine-reads$ grep 'CCTACGGGRSGCAGCAG' 210511155290-1-2-1_S366_L001_R1_001.fastq | wc -l
0

And I got the same "0"

If at all possible I think this is the thing to do. They may have already been removed and you just don't know it yet, which saves you some work if it is the case :slightly_smiling_face:. They may also have documentation available on their website. Even if you do not use them currently it is worth trying to contact them because without you having additional documentation, they are the only ones who will know exactly what was done to your data.

Have you asked around your lab? It is possible someone has this information recorded in their lab notebook, this is a bit of long shot but worth trying.

While this may seem like a giant pain, it is worth doing, especially if you are going to use q2-dada2 to denoise your data as it will not work if you do not remove all of the non-biological material from your sequences :disappointed_relieved: Manually looking for standard primers is a good backup plan, but is a pain and definitely not as reliable as just finding out from the sequencing center the exact details of what is and is not included in your data.

2 Likes

Hello @Keegan-Evans , thank you for your help!

I have just found documents the company sent along with the fastqs in which they detail Sentinel Methods they used. It says they used Sentinel pipeline to assess Phred quality scores of the fastq files and then they trimmed the primers and low quality sequences (Phred < 20). Clusters with abundance < 2 were removed from the analyses due to this parameter being related to chimera sequences.

I am not familiar with the procedure, should I consider they sent me the fastq files after trimming or is it possible they sent me the original fastqs with the primers in it, even if they sent this method documentation? Is it costumary to just explain the method used for the analyses but sent the fastqs without trimming?

Considering the good quality of the reads, I am inclined to believe they sent me the fastqs after trimming.

@microbiotaphyto, it sounds like they are already removed! If 95% of your sequences get removed during denoising due to chimera detection, we may want to revisit the subject but I think you are good to move forward at this point :grinning:

Here are a few videos from a recent workshop that discuss denoising and appropriate trimming and truncation: Lecture and Tutorial.

2 Likes

Hello

I got in touch with the sequencing company and they told me they sent me the raw files, no trimming or truncation done.

I will check out the links you have suggested. Thank you very much!

1 Like

@microbiotaphyto,

I am glad they responded to you! It is really hard to know what exactly has been done to your data without the sequencing center telling you. Why more centers do not include this information with the data is very confusing to me but it seems to be the case :disappointed_relieved: