Hi @gmdouglas and @Nicholas_Bokulich - The sequencing depth was not toooo far off among all runs (on average 70-75k before denoising, with one run that is of not the best quality - it has 55k). Just to make sure I am doing everything roughly correctly (in your minds) - mind if I run a few things by you, with a few more questions?
For the single-indexed reads, I receive the data in the "normal" EMP way of 3 files: forward.fastq.gz, reverse.fastq.gz, and barcodes.fastq.gz. I then use the following (for example) to demultiplex the reads.
qiime tools import \
--type EMPPairedEndSequences \
--input-path reads \
This brings me to (new) question 1:
If I am correct, in this process, all non-biological sequences are stripped from the data, correct? (running cutadapt with the primer sequences would suggest as much). However, this yields reads that are on average 297 bp long, which to me suggests that the primers still need to be trimmed (for a 2x300 run), but if I run cutadapt, most reads get thrown out because they don't have the primer.
Regardless, here is what the quality plots look like after importing. This run was a bit overclustered, so the turnaround is not great, as you can see.
visualization.qzv (301.8 KB)
With the dual-indexed runs, they were demultiplexed by IMR (saving me a step; yay!). As I mentioned above, I import them using the following:
qiime tools import \
--type SampleData[PairedEndSequencesWithQuality] \
--input-path demux/ \
--output-path reads_qza/hakai_16S_IMR4sequences.qza \
At this point the reads are on average 300 bp. I then use cutadapt to trim primers, which yields reads that are about ~281-282 bp, on average. Quality plot example here:
hakai_16S_IMR4reads_trimmed_summary.qzv (315.1 KB)
So, then, if I attempt to use the same settings in dada2 for each run, say:
--p-trim-left-f 30 \
--p-trunc-len-f 270 \
--p-trim-left-r 30 \
--p-trunc-len-r 210 \
--p-max-ee-f 3 \
--p-max-ee-r 8 \
I guess I would think that the merged reads will end up with different length reads because the starting reads were different lengths, but I guess this is not the case....thanks to truncating at the same basepair in each.
It does seem like the average lengths are roughly the same at ~312, but they are still shorter than i would expect (411 bp -60 bp from trimming should be 352....unless i am misunderstanding something, which is entirely possible). Are these shorter than they should be?
And, I guess the dual-indexed runs have longer max-lengths because the R2 read quality is generally better so then fewer reads get denoised away?
Anyway - moving on to feature frequencies.
You can see that the single-indexed run has fewer rare ASVs than the dual-indexed run, with a higher frequency for the median.
For the dual-index library example:
So, I suppose this could be the source of some of the discrepancies in ASV# between the two, as per your suggestion, @gmdouglas. Turns out, if I filter out more of the rare reads (say, --p-min-frequency 10 and
--p-min-samples 2), then the different types of libraries start to converge on each other when I do a quick PCoA. PHEW. I had only filtered --p-min-frequency 2 at the start prior to writing this post, which perhaps wasn't enough to get rid of the higher diversity of the rare ASVs in the dual-indexed runs.
This brings me my final question: It is my understanding that with dada2, one should denoise each run separately but using the same settings, and then merge. Or, is it okay to use different settings for different runs, so long as they yield the same length merged reads? I ask because one of my runs was not so great, and in order to merge the reads, i need to do a bit more trimming and truncating than the other 4 runs. Is this okay? Or do I need to resequence this amplicon pool (with the hopes of a more successful run?)?