Possible Analysis Pipeline for Ion Torrent 16S Metagenomics Kit Data in QIIME2?

Nicholas_Bokulich · April 8, 2020, 6:37pm

Yes. Normally (when read orientation is consistent) truncation can be used with extract-reads to match the exact same site and length as in the query. For example:

imaginary query:
ACGTACGTACGTACGTACGTACGTACGTcccccccccccccccc

imaginary reference:
gggggggggggggggACGTACGTACGTACGTACGTACGTACGTccccccccccccccccgggggggggggggg

(lowercase "g"s == primer sites)
(lowercase "c"s == part of amplicon sequence that corresponds to region that was truncated from query by dada2)
(bold == part of amplicon that appears in truncated forward reads/reference sequences)

Using extract-reads on that reference would yield an exact match (or in real life just match the same region/conditions so enable exact or similar matches between query and ref):
truncated query: ACGTACGTACGTACGTACGTACGTACGT
truncated refseq: ACGTACGTACGTACGTACGTACGTACGT

However, when orientations are mixed, you don't know which end of the amplicon your query sequences are on... you will have a mixture of:

ACGTACGTACGTACGTACGTACGTACGTcccccccccccccccc

and its reverse complement:

gggggggggggggggACGTACGTACGTACGTACGTACGTACGT

Which truncated (on 3' end of each read) will yield:

ACGTACGTACGTACGTACGTACGTACGT

and

gggggggggggggggACGTACGTACGTAC

So you are covering different parts of the complete amplicon. Hence, the reference sequences should be untruncated to cover the full amplicon so that you can hit any part.

Correct

Yes, remove

keep both, but adjust to expected ranges (this could also cause unassignments if the ranges are not being set correctly). Check the lit for what the expected amplicon ranges are — setting broad limits probably does not hurt, these are really just used as safeguards, since occasionally some primer sets and reference database combinations can yield some spurious hits that cause issues during classification... unusually short or large amplicons are a good indicator of spurious hits.

What region was good and what region was bad? Sometimes it's not luck, sometimes it does depend on region, primer, reference db, etc... e.g., I've seen issues with V1 primers on some databases before because some of the reference sequences might not have the correct forward primer included in the sequence. The default settings were designed based on benchmarks of different 16S domains as well as ITS... sort of general "catch all" settings... but for unusual amplicons (maybe?) or for other marker genes (not you, since you have 16S but just saying) these settings may need to be tweaked. Having a mock community to re-optimize for novel primer sets is ideal!