Losing a LOT of representative sequences after denoising

bkramer · November 21, 2021, 7:35pm

Hello!

I recently got 16S data from Mr DNA which I'm trying to use in Q2 (version 2021.4) for metagenomic analysis. I was able to successfully import my data into Q2 and removed the updated 16S primers (515F/806R) using the following command line:

qiime cutadapt trim-paired
--i-demultiplexed-sequences /home/qiime2/Desktop/16SPostImported_111921/demux-paired-end.qza
--p-front-f GTGYCAGCMGCCGCGGTAA
--p-front-r GGACTACNVGGGTWTCTAAT
--o-trimmed-sequences /home/qiime2/Desktop/16SPostImported_111921/reads-cutadapt.qza
--verbose > cutadapt_log.txt

The resulting reads-cutadapt.qza file (see qzv version attached) doesn't look bad. The median quality scores of base sequences for the forward reads never went below 20, though the 250th base sequence in reverse reads had a median quality score of 11, so I used the following code for denoising:

qiime dada2 denoise-paired
--i-demultiplexed-seqs /home/qiime2/Desktop/16SPostImported_111921/reads-cutadapt.qza --p-trunc-len-f 251 --p-trunc-len-r 249 --o-table /home/qiime2/Desktop/16SPostImported_111921/table.qza
--o-representative-sequences /home/qiime2/Desktop/16SPostImported_111921/rep-seqs_112021.qza --o-denoising-stats /home/qiime2/Desktop/16SPostImported_111921/denoising-stats.qza

However, the resulting table.qza (see .qzv version attached) states that only 56 representative sequences passed the filters, meaning that all of the sequences from a large portion of my samples did not pass the filters.

Am I missing something? Is the truncation command not correct? Could I have used the wrong primer inputs? Any help would be greatly appreciated.

Thanks!

Ben

16S_reads-cutadapt.qzv (319.8 KB) 16S_table.qzv (411.6 KB)

timanix · November 22, 2021, 8:12am

Hello!

It is highly possible that you lost most of the reads due to the parameters you set for the truncation, so all reads that are shorter were filtered out. Since you are working with 515F/806R primers, you can safely truncate more (around 200-220 should work fine, for example)

bkramer · November 22, 2021, 1:44pm

Thank you! I had actually heard that if the median quality of the bases is <20, then that's a good place to truncate.

However, most of the bases for either forward or reverse reads are well above 20, even at the ends of the reads, so I'm not sure where to truncate now...

timanix · November 22, 2021, 2:19pm

You are right, but currently your issue is a loss of the reads during filtration step, and most probably you are facing it just because you set Dada2 to truncate the forward reads at position 251/249, meanwhile in the plugin description it is stated:

Reads that are shorter than this value will be discarded.

So the simple reason why you lost a lot of reads is that most of the reads are shorter than it.

In addition, your amplicons are quiet short since they were amplified with 515F/806R primers, so even if you will provide 200-220 (as an example) as a truncation value, it will not affect the length of merged pairs, but improve overall output by:

removing low quality bases at the ends (do not worry about truncating bases with a good quality - they will be in the overlapping region anyway, and you still have enough of the bases to overlap)
allowing more reads to pass the filter since most of your reads are longer than it.

So I would suggest to try it like this for both forward and reverse reads to check if your issue will be solved. Please, let us know if the issue will remain.

bkramer · November 23, 2021, 2:22pm

Thanks again for the quick response!

I guess I always thought of "truncate" as in the sequences kept are those that are less than or equal to the length of the base I select...that's what I was originally told at least. I've had some difficulty understanding the difference between trim and truncate to be honest. When a given median base's quality is <20, I truncate and to the base sequence immediately preceding the one I'm removing that has a quality >20...is this wrong? And if so, what is a good metric for determining where to truncate exactly?

Regardless, I will truncate to 200-220 and let you know what happened. Is there a reason why you chose that range for my particular set of reads so that way I know what to do moving forward?

Thanks again!

timanix · November 23, 2021, 3:56pm

I can understand your confusion because I was confused regarding it as well at some point. Here is a citation from plugin description:

--p-trunc-len-f INTEGER
                         Position at which forward read sequences should be
                         truncated due to decrease in quality. This truncates
                         the 3' end of the of the input sequences, which will
                         be the bases that were sequenced in the last cycles.
                         Reads that are shorter than this value will be
                         discarded. After this parameter is applied there must
                         still be at least a 12 nucleotide overlap between the
                         forward and reverse reads. If 0 is provided, no
                         truncation or length filtering will be performed
                                                                    [required]

So, it will discard any sequence that is shorter than truncation value and truncate longer sequences.

So did I!

--p-trim-left-f INTEGER
                         Position at which forward read sequences should be
                         trimmed due to low quality. This trims the 5' end of
                         the input sequences, which will be the bases that
                         were sequenced in the first cycles.

Basically, we are trimming at the beginning of the sequence and truncating at the end!

Sorry, but I have some difficulties to understand these questions completely. I will try my best to clarify, please, feel free to ask again if my answer will not be the one you are looking for.
First of all, you need to take a look on the Quality plots of demultiplexed sequences (better with primers already removed). Our goal will be to determine truncation value in a way that will allow us to remove the bases with low quality at the ends of the reads but still keeping our reads long enough to overlap for merging. based on your plots, I would like, for example, to truncate forward sequences at position 227 and reverse at positions 217 or 230. In your case, you used higher values:

So you not only did not remove bases with low qualities, but also discarded all the reads that shorter than 251 (forward) and 249 (reverse). In fact, only pairs in which forward read is longer or equal to 251 and reverse read is longer or equal to 249 retained. It is why I recommended to set lower values for the truncation.

Another concern is an overlapping region. One do not want to truncate paired reads too much since it can decrease overlapping region and pairs may fail to merge. But, since you used V4 region for 16S library, your amplicons should be relatively short and and you can truncate more bases at the ends. It is why it is safe to truncate forward reads at 200-220 and reverse at 200-210 without thinking about it too much, but you can try other values as well just to compare.

Hope it will work

system · December 24, 2021, 9:56pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.