I encountered a problem while processing 803f-1392r amplicon sequences using QIIME 2. The full length of the 803f-1392r sequence is 590 reads, but the sequencing was done using PE300, with 301 reads per single end. After removing the primers and adapters, the forward reads are 276, and the reverse reads are 280. Even if both the forward and reverse sequences are not truncated during merging, the merged sequence only reaches 556 reads. Therefore, it is possible to denoise, but merging is not feasible.
Since these sequences are supplemental samples, my previous samples were amplified using the 308f-806r primers. After denoising, the two single-end sequences were merged.
I am concerned that if the 803f-1392r primer sequences are processed as single ends for ASV generation, there would be significant information loss. Therefore, I would like to ask if it is possible to generate an ASV file using two non-merged single-end sequences (forward reads and reverse reads).
In this case, is it possible to cluster and generate OTUs? If I use 99% clustering to generate OTUs, will there be a significant difference in species annotation results compared to those generated by denoising to produce ASVs?
I don't think I can confidently answer whether OTU clustering will generate significantly different taxonomic assignments. What I do know is that OTU clustering is a cruder, older method that has fallen out of favor for more sophisticated approaches like DADA2. What you can still do is analyze the forward and reverse reads separately with DADA2 (importing both as single end sequences), and then merge the resulting feature tables. You could also use the same approach upstream of OTU clustering.
As you suggested, I have denoised the forward and reverse sequences separately.
time qiime dada2 denoise-single
--i-demultiplexed-seqs forward-trim-demux.qza
--p-trim-left 0
--p-trunc-len 276
--o-representative-sequences dada2-forward-rep-seqs.qza
--o-table dada2-forward-table.qza
--o-denoising-stats dada2-forward-denoising-stats.qza
--p-n-threads 8
--verbose
Saved FeatureTable[Frequency] to: dada2-forward-table.qza
Saved FeatureData[Sequence] to: dada2-forward-rep-seqs.qza
Saved SampleData[DADA2Stats] to: dada2-forward-denoising-stats.qza
time qiime dada2 denoise-single
--i-demultiplexed-seqs reverse-trim-demux.qza
--p-trim-left 0
--p-trunc-len 280
--o-representative-sequences dada2-reverse-rep-seqs.qza
--o-table dada2-reverse-table.qza
--o-denoising-stats dada2-reverse-denoising-stats.qza
--p-n-threads 8
--verbose
Saved FeatureTable[Frequency] to: dada2-reverse-table.qza
Saved FeatureData[Sequence] to: dada2-reverse-rep-seqs.qza
Saved SampleData[DADA2Stats] to: dada2-reverse-denoising-stats.qza
I would like to ask how to merge the forward and reverse table.qza, rep-seqs.qza, and denoising-stats.qza files?
And how should I handle the issue of amplified or redundant features and representative sequences after merging?I am concerned about the issue of redundancy after merging. Is it possible that denoising the forward and reverse sequences separately might lead to the same real biological sequence being represented as two different ASVs—one from the forward read and another from the reverse read? These ASVs would not be automatically recognized as the same ASV during merging but would instead be retained as two separate ASVs in the feature table. This redundancy could lead to an overestimation of the number of ASVs in subsequent analyses, increasing the complexity and redundancy of the data.
Do I need to perform denoising again or clustering on the merged files?
After discussing with some other QIIME2 team members, I realized that I've steered you in the wrong direction. Denoising each read direction separately and merging the resulting ASV tables is going to cause a lot of interpretation issues. As you pointed out, the first obvious issue is an immediate over estimate of observed features. There is a narrow set of questions that you could better answer with this approach, e.g. is taxon t present in my data, but in general it is not a good idea.
Thus I would suggest proceeding with only the forward reads. If you really want to extract all information possible from your data you could run two parallel pipelines with the different read directions, but you shouldn't merge them. Sorry for the confusion.
I think using forward reads only is probably your easiest answer, and certainly has a precedent other places.
However, you might also be able to use a tool like Sidle, which allows the scaffolding of multiple disjoint amplicons across regions. It's a little more avante gard than the use of forward reads and has some restrictions, but might be an option.