I am working on amplicon sequencing data of some markers, including 18s (with 1391F and EukBrR primers), using qiime2-2022.8. One of the problem I found is readthrough, in which target lengths are less than the read length, which is similar to ITS. This lead to the problem that there might be (reverse complements of) primers and adapter (Nextera) left in the reads.
Similar to this topic, I have tried to first trim the reverse complements of primers (trim reverse complement of reverse primer for forward reads and vice versa), then trim the primers and filtered out those without primers. Here are the commands and the quality plots of trimmed sequences.
However, if I only use the plot to determine DADA2 parameters (--p-trunc-len-f,--p-trunc-len-r), most of the reads will not pass the filters (too shorts). Therefore I am thinking of using read lengths for DADA2, in this case for example 2% 93 nts.
My questions are
Is it okay to use read length statistics to determine DADA2 parameters?
I have tried with different parameters and the best seems to be --p-trunc-len-f 93, --p-trunc-len-r 93. However, the percentages of non-chimeric results are still low ( 20.5-86.3%). Most of the reads were discarded because they cannot merge (20.6-87.1 % merged)
I wonder if it is possible to trim only the primers, using DADA2 to merge the reads, then trim the reverse primers that are found towards the end of the representative sequences. Is there any tools that can be used?
Another alternative would be joining the reads first, then denoise using Deblur (deblur denoise-other). I am wondering if it makes sense to use silva-138-99-seqs.qza for --i-reference-seqs.
I also have to analyse other markers and there seems to be some readthrough as well, although to a much lesser extent. I would like to know if in the first step I should trim the reverse complements of the primers, or trimming just Nextera adapter (CTGTCTCTTATACACATCT) would be fine.
Thank you for your advices in advance. Please let me know if you have any questions.
According to the 18S rRNA EMP protocol, you should expect to see ~260 bp amplicons. Setting your truncation lengths to 93 bp is too short, and explains why many of your reads are not merging. You should be able to truncate out to ~150-200 bp in each direction, to ensure overlap of the paired-end reads, almost like you would a 16S rRNA V4 region.
Your trimming approach looks sane to me, I'd just increase your truncation lengths and re-run on your cutadapt output. Also, do not worry too much if some these sequences "sneak through" this part of the pipeline. DADA2 would likely detect some of these and discard them as "chimeras". Also, post DADA2, you can use the quality-control plugin like so:
# Remove poor quality sequence that do not match SILVA.
qiime quality-control exclude-seqs \
--i-query-sequences repseqs.qza \
--i-reference-sequences silva-138-99-seqs.qza \
--p-method vsearch \
--p-perc-identity 0.90 \
--p-perc-query-aligned 0.90 \
--o-sequence-hits hits.qza \
# Filter DADA2 table to match "hits" (sequences/features we want to keep):
qiime feature-table filter-features \
--i-table table.qza \
--m-metadata-file hits.qza \
Thank you @SoilRotifer very much for your recommendations. I did try increasing the -p-trunc-len-f and -p-trunc-len-r to be 190 and 162 respectively. However, the percentage of non-chimeric seems to be lower than when I use the truncation length of 93 in most of the samples (3.9-52.1 % as compared to 20.5-86.3%). A lots of sequences were not pass the filter as they were too short. Here are the stats when using truncation length of 93-93 and and 190,162: 18s-stats-93-93.qzv (1.2 MB) 18s-stats-190-162.qzv (1.2 MB)
The lengths of representatives sequences when using truncation of 93,93 are shown below.
Would you say that be cause of the representative sequence lengths, it is more reasonable to select the 190-162 one although the percentages of non-chimeric decreases?
This is the quality plots after trimming in case this is useful. 18s-trimmed-all.qzv (327.8 KB)
For the expected length of amplicon, I was wondering if the length of 260 bp includes forward and reverse primers (40 bp)? If not then, as both primers have been trimmed, the expected length of the amplicon would be around 210 bp. Is my understanding correct?
Regarding the quality-control plugin, I planned to use BLAST for taxonomic assignment and filter out those without assignments. Would you recommend using the quality-control plugin instead?
Yes. Typically the amplicon length includes the primer sequence. So, I would also surmise that ~210 bp would be a good target. The extra read-through is likely causing mismatches with the merging. You can likely set the forward truncation anywhere from 120-140bp, and the reverse truncation from 80-110 bp. As long as you have ~12 bp overlap between the forward and reverse reads, you should be okay.
The read through mismatching is less of an issue when you use vsearch to merge. If I remember correctly, vsearch will auto-trim the overhanging mismatching parts, within reason. From here you'd use deblur. Give it a try and see if it works any better.
I'd recommend the quality-control for filtering. But you can certainly filter based on taxonomy too. I prefer the naïve-bayes approach for classifying reads, but any taxonomy method should work.
I've just notice that I miscalculated the target length without primers. It should be ~ 260-40 = 220 bp (+/- 50 bp according to 18s EMP protocol). However, the concept still is the same. I did try using different paramerers for truncation and it seems to give overall lower percentages of non-chimeric for some reasons. Here is an example: 18s-stats-116-116.qzv (1.2 MB)
I'm not sure if I understand this correctly, but the primers at the 3' end should be removed by now and thus not resulting in overhanging. I think the problem for my data is that the reads after trimming are really shorts. Could you please elaborate on this or did you mean that I only trim the primers at the 5' end of forward and reverses reads, then merge using vsearch?
I did try merging with vsearch and deblur with the current trimmed data though but I'm still figuring out the stats to compare with DADA2 results.
Here is the deblur stats qzv. 18s-stats-3-170.qzv (260.5 KB)
For deblur I only trimmed the 5' end of the reads and used the Silva 138 SSURef NR99 full-length sequences as a reference. It seems that a lot of sequences pass the trimming length of 170 (low fraction-artifact-with-minsize) but fail to align with the reference (high fraction-missed-reference). For this reason, I'm not sure if this would be the appropriate approach.
For now, I consider using DADA2 for this data. I trimmed the reads from both ends and used DADA2 with no truncation (--p-trunc-len-f,--p-trunc-len-r =0). Then, as filtering based on length doesn't seems to be supported yet, I filtered out the sequences that might be too short (100 bp) outside of qiime2. A lot of sequences still have no taxonomic assignments but I think the result is better than that from deblur.
What do the results look like if you only processed the R1 (forward) reads through dada2 denoise-single? It might be worth checking how these compare to your surviving merged reads. Both in terms of how much sequence data you can retain, as well as the taxonomic distribution. There have been cases in which I was only able to use the forward reads, and my reverse reads were really poor.