I'm trying to denoise an amplicon dataset (not 16S or 18S) for downstream analysis and am after some advice / discussion on 'best practice' for dealing with 'other' amplicon datasets. The amplicon itself is a CRISPR array and can be of variable length.
Currently, I've merged my reads, imported them into Qiime2 and done some quality filtering. I now need to denoise the dataset before I look at interesting biological questions (diversity / overrepresented sequences etc).
I believe that Deblur requires sequences to be the same length, so I can't use that as length difference is biologically meaningful. I also believe I can't use DADA2 as I have already merged my sequences. Currently I am simply de-replicating sequences with the vsearch plug-in but would like to make some attempt at removing sequencing error.
Any advice or thoughts would be very welcome, and apologies for the broad scope if this question.
It would be ideal to not merge the sequences and use DADA2 with trunc-len set to 0. However you might want to trim the low quality ends still, and, there isn't a method in quality-filter such as q-score-paired which can do that (as quality-filter is usually used with deblur so it only does single and joined reads).
It depends on how you merged them, if the quality scores "make sense" in the merged locus, then DADA2 should be able to manage it, but it's certainly not ideal. I don't know if this option is in practice better or worse than not performing any quality truncation.
It occurs to me that ITS probably has this problem as well, and so I thinkquality-filter should probably be adapted to support paired-end.
Another bad option would be to only look at the forward reads and then you don't have the paired end problem with quality-filter, but that's no good.
Thanks for the quick replies. Posting my approach in case others find this useful in future. In the end I've gone with no denoising steps and just dereplicated and clustered sequences (at 99% similarity) with vsearch. Prior to importing to Qiime2 I used an error correction program (Spades), merged reads with Flash and the resulting visualizations of read quality in Qiime2 looked pretty good.
Using the alpha diversity analysis in Q2- the results match our expectations (comparing our treatment / control groups) and data from the wet lab- qualitatively at least.
I presume there is a danger of inflating diversity estimates without the denoising steps, but I'm more interested in relative differences than absolute numbers here. Also, I appreciate that I am using Qiime2 for a purpose somewhat outside it's scope.
That's probably the case, but you should be fine if the goal is relative difference. OTU picking causes a similar inflation, but it doesn't seem to really impact the conclusions you might draw from diversity in practice.