In my "Demultiplexed sequence counts summary" and "Demultiplexed sequence length summary" in the .qzv file resulting from the demux step, I'm wondering why the forward and reverse sequence reads are identical? (I'm using EMPPairedEndSequences).
I noticed this in the Atacama Soils Tutorial as well (in the .qzv file here), so I assumed this was not an error, but I'm having trouble understanding why this is the case.
I may be misunderstanding your question, @Dot, but I'll give it my best shot. If anything here doesn't make sense, please let me know!
The sequence counts summary:
Here you have counts of the number of forward and reverse reads per sample. For the paired-end sequences I've worked with at least, you likely have a problem if you have forward reads without matching reverse reads. That would indicate your "paired-end" reads aren't all paired.
The sequence length summary - some background and an educated guess:
Some amplicons are frequently very consistent in length. The 16s 515-806 primers, for example, usually produce amplicons that vary by at most a few nucleotides in length. Other (ITS ) are highly variable in length. The Atacama Tutorial data set is a small subset taken from relatively low-biomass samples - as such, it may be less likely to show variable length. Depending on your amplicon and samples, there could be something similar going on with your data.
Quality plots:
I don't think quality scores are coupled to number of reads or read length in this context. Barring unusual circumstances, I wouldn't expect them to be correlated.
For the sequence counts summary:
I see, thanks for clarifying. I think I just wasn't quite understanding how paired-end sequencing worked (but I think I know what's going on now)! So if I'm thinking about the total number of sequences, e.g. If "sample A" has 100,000 forward sequence reads and 100,000 reverse sequence reads, would it be correct to say it has a total of 100,000 paired end reads? Or would it be 200,000 total paired-end reads?
For the sequence length summary:
That makes sense. We are using the 16S 515F-806R primers. I think what is surprising to me is that there isn't at least some variation, even by at least a few nucleotides. (For all of our sequencing runs so far, using the same exact primers, I have yet to see any variation in this.)
I'm probably not the best person to ask on this one, @Dot! I spend most of my time with code rather than data. I'd probably say "Sample A has 100,000 reads" because I'm lazy, but you could probably say "100k read pairs" if you needed to be specific. I haven't thought too much about this, probably because the interesting part of the analysis - the part I'm more likely to talk about with others - generally happens after the forward and reverse reads have been joined. At that point you really do have 100k (ish) reads, and it doesn't matter as much what shape they were in when you sequenced them.
Re: sequence length - it feels kinda crazy at first, but I've seen the same lack of variability often. Remember that a seven-number summary isn't telling you there is no variability - just that most of your many reads are the same length. For the data you screenshot, fewer than 2% of all sequences are longer (or shorter) than 251 nts. You could still have 294894 reads shorter than 251 nts, and the same for longer.
Side note - this visualization is actually subsampling your data for the length summary (in your case to 10k f/r reads). Technically, fewer than 200 of the 10k subsampled reads are shorter than 251 nts.
This is identical in concept to my explanation above, but does introduce a little potential for variability - if the subsampling were different when you re-ran this viz, your results could vary a little.