Can we use extracted reference reads to calculate expected amplicon sizes?

yanxianl · August 13, 2019, 9:53am

Hi Colin,

Thanks for pointing to the relevant resources to do that.

I wanted to denoise my pair-ended sequences using DADA2 with pooling=TURE, so I did it in R. One of the checkpoints in the DADA2 tutorial was to inspect the length of the merged sequences, excluding ones that deviate too much from the expected size range. This is why I wanted to know the expected amplicon sizes for the universal primer set I used.

While I was reading through the Github repo you linked, I also found a solution to my question. I'm posting it here in case someone else may find it useful.

Run digital PCR in TestPrimer. The SILVA database provides services on testing your primers. For the database, I chose SSU r132, non-redundant reference sequence collection (RefNR). The primer mismatches were set at two levels of stringency, allowing zero or one mismatch (length of 0-mismatch zone at 3' end: 5). When the searching is done, download the sequences by clicking Add sequences to cart.
Extract amplicon sequences in QIIME2. The downloaded sequences are full-length rRNA sequences. We can extract the amplicon sequences using the extract-reads command in QIIME2. For simplicity, I'll only show commands for the search with no mismatch.

2.1 Convert rRNA sequences to DNA by replacing all of the uraciles with thymines

sed -i 's/U/T/g' no_mismatch_silva132_fullLen.fasta

2.2 Import sequences

qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path no_mismatch_silva132_fullLen.fasta \
--output-path no_mismatch_silva132_fullLen.qza

2.3 Extract amplicon sequences

qiime feature-classifier extract-reads \
--i-sequences no_mismatch_silva132_fullLen.qza \
--p-f-primer AGAGTTTGATCMTGGCTCAG \
--p-r-primer GCWGCCWCCCGTAGGWGT \
--o-reads no_mismatch_silva132_amplicon.qza

2.4 Export extracted sequences

qiime tools export \
--input-path no_mismatch_silva132_amplicon.qza \
--output-path exported_no_mismatch
Plot expected amplicon sizes in R. We can use R package seqinr to import and do the length calculatiion for our fasta files.

library("seqinr")

hits_0mm <- read.fasta(file = "exported_no_mismatch/dna-sequences.fasta")

length(hits_0mm) # number of sequences in the fasta file

length_0mm <- table(getLength(hits_0mm)) # get the length of all the sequences in the fasta file

barplot(length_0mm, main = "allow no mismatch", xlab="Amplicon lengths", ylab="Frequency")

Here's what I got: