quick question: before I train my greengenes classifier i extract the fragments i need using the same primers Ive used for my library prep pre-sequencing (V3-V4). Now my reverse reads for this particular run are of awful quality (none of the bases are above Q20), so Ive decided to go forward denoising with dada2 denoise-single using only forward reads.
Since the sequenced amplicon is 4-500 base-pairs, and I have now only forward reads of 200-220 base pairs, does it really make sense to train my classifier on the whole amplicon? is there a way to extract only reads up to 2-300 bp from the FORWARD primer to then train the classifier with?
Part 2 of my problem is that:
whenever i use the classifier trained on my whole V3-V4 amplicon on my quality-truncated reads (only forward, 170-200 bases max) I get a handful of assigned reads to genus and up to species level, but about half of the total is Unassigned and the other half is assigned to only k__Bacteria, with a great confidence (>0.99), but not deeper than kiingdom level. Any idea about why that is?
If the majority of your reads are being assigned as "Unassigned" and "k__Bacteria", it could mean that your reads are in a mixed orientation, with respect to the classifier. That is, the both the forward and reverse reads appear in the R1 files, and similarly, in the R2 files. You can read many forum threads about this here.
One way to sanity-check if this is the case, is to run qiime feature-classifier classify-consensus-vsearch ... as vsearch does not care about read orientation. You can download the reference sequence and taxonomy files from here for use with classify-consensus-vsearch.
If you obtain reasonable taxonomic classification, then your reads are in mixed orientation for the naive bayes classifier. If you get the same result, then there is likely another issue...
Some tricks to get around this issue are outlined in the following posts:
I'll provide a third option below, which is inspired by the above two posts. But the second post is likely the way to go.
This approach outlined below assumes two things:
You will be using cutadapt to remove your primer sequences.
...and second manifest that swaps file names, such that R2 is under the forward-absolute-filepath column, and R1 is under the reverse-absolute-filepath:
Once you've imported both of these into QIIME 2, you can run cutadapt, as you normally would on each of these two versions of the imported data.
Be sure that you set --p-discard-untrimmed! This way we'll remove all the read pairs in the wrong orientation. That is, the orientation is set by the manifest files. You should only expect about half the data to be trimmed. Repeat this command on the second batch of data imported with other manifest file.
Additionally, the reason for doing it this way is to avoid duplicate sequence names in one file.
At this point, assuming no other issues are involved, you should be able to merge the two sets of data. Currently, there is no way to merge demuxed data in QIIME 2. So, you'd have to export the resulting demuxed fastq files and then manually concatenate them for each sample. Then re-import via a manifest. I suppose you could denoise each separately then merge the features, but not sure if you'd be able to use the same trimming and truncation options.
Others on the forum might have additional solutions too.
Thank you very much for your detailed answer @SoilRotifer!
I just ran vsearch classification "quickly" to check, and the results did not improve much, but I've realised I still got the primers in the sequences. (I read somewhere that the primers, being part of the sequence itself, could be left in). So now I'm restarting the whole pipeline from cutadapt trim-paired first, as you suggested to @Nisha. I am a bit baffled about all the options that cutadapt trim-paired allows for searching an adapter sequence (--p-adapter, --p-front and --p-anywhere, for both -f- and -r-). My reads were obtained after library prep with Illumina's 16S Metagenomic Sequencing Library Preparation kit so:
The full length primer sequences, using standard IUPAC nucleotide nomenclature, to follow the protocol
targeting this region are:
16S Amplicon PCR Forward Primer = 5'
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG
16S Amplicon PCR Reverse Primer = 5'
GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC
⢠This method can also be utilized to target other regions on the genome (either for 16S
with other sets of primer pairs, or nonâ16S regions throughout the genome; ie any
amplicon). The overhang adapter sequence must be added to the locusâspecific primer
for the region to be targeted (Figure 1). The Illumina overhang adapter sequences to be
added to locusâspecific sequences are:
Forward overhang: 5â TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGâ[locusâ
specific sequence]
Reverse overhang: 5â GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGâ[locusâ
specific sequence]
Which cutadapt option would you advise to use here? and should I input the full overhang+primers sequences or the primers sequences only?