I'm interested in calculating the expected amplicon sizes of a primer set targetting V1-2 region of the 16S rRNA. Are the reference sequences extracted from the database by qiime feature-classifier extract-reads suitable for that purpose? The default minimum combined primer match identity threshold (0.8) is quite lenient compared to what's usually used in the in silico evaluation of universal primer sets, which allows no or a single non-3′-mismatch.
Yes, within Qiime 2, I think that the extract-reads command is great place to start. Once you run this, you could try changing the identity threshold. I also feel like 0.8 is low if you are not using degenerate primers, but you can always try several values.
You have probably found this already, but there are other tools outside of qiime. I haven’t used these, but they seem popular:
Thanks for pointing to the relevant resources to do that.
I wanted to denoise my pair-ended sequences using DADA2 with pooling=TURE, so I did it in R. One of the checkpoints in the DADA2 tutorial was to inspect the length of the merged sequences, excluding ones that deviate too much from the expected size range. This is why I wanted to know the expected amplicon sizes for the universal primer set I used.
While I was reading through the Github repo you linked, I also found a solution to my question. I'm posting it here in case someone else may find it useful.
Run digital PCR in TestPrimer. The SILVA database provides services on testing your primers. For the database, I chose SSU r132, non-redundant reference sequence collection (RefNR). The primer mismatches were set at two levels of stringency, allowing zero or one mismatch (length of 0-mismatch zone at 3' end: 5). When the searching is done, download the sequences by clicking Add sequences to cart.
Extract amplicon sequences in QIIME2. The downloaded sequences are full-length rRNA sequences. We can extract the amplicon sequences using the extract-reads command in QIIME2. For simplicity, I'll only show commands for the search with no mismatch.
2.1 Convert rRNA sequences to DNA by replacing all of the uraciles with thymines
sed -i 's/U/T/g' no_mismatch_silva132_fullLen.fasta
Interesting! So we expect the average amplicon to have a length just over 300.
Are you using published primers? If so, you may have good look seeing what lengths other folks get with these same primers. The 805R and 533F 16S primers in the EMP protocol would imply 272 bp long reads, but if you are using the full EMP protocol these will consistently produce 250 and 251 bp sequences.
I think the biological observation will be more valuable than the primer search tools.
We're using the 27F(AGAGTTTGATCMTGGCTCAG) and 338R(GCWGCCWCCCGTAGGWGT) primer set, targetting the V1-2 region of the 16S rRNA. So, yes, we expect the average amplicon size to be somewhere around 310.
The majority of merged sequences are between 270 bp and 346 bp, with a long tail on the right reaching 488 bp at the end. This was also observed in the expected amplicon size distribution allowing no primer mismatch, so I think it's safe to not do any sequence length based filtering.
p.s.
The expected amplicon sizes calculated by allowing no mismatch, one non-3'-end mismatch or setting minimum combined primer match identity threshold (0.8) in QIIME2 are largely overlapped. So using the extracted-reads to calculate expected amplicon sizes is reliable.