Hi all,
I do have a question regarding the feature-classifier classify-sklearn
.
Let me briefly introduce the background that leads to my question.
In an upcoming project, I want to sequence metazoa in bulk soil. Everyone would recommend to use COI as a marker gene but I decided for 18S rRNA because in most previous studies that conducted sequencing from bulk soil using COI, the proportion of metazoan reads was often less than 5% (if more than 5% were metazoan reads, a high number of touchdown cycles was used during PCR, which is not what I want to do). I guess COI is too variable (which isn't bad in terms of taxonomic resolution though) to design primers specific for metazoa. I found a primer set that amplifies a ~600 bp amplicon that appears > 99% specific for soil metazoa. The forward primer is located before the V4 region and the reverse primer after the V5 region. I did a few alignments and the reverse primer is the one that is really specific for metazoa, the forward primer appears to be more like a universal eukaryote primer. I'm also working on ways to shorten the amplicon (e.g. nested PCR) but for now let's just say that I have no choice but to sequence a 600 bp amplicon.
Now, a 600 bp amplicon isn't ideal because I will never have any overlap between forward and reverse reads if I sequence on a MiSeq with 2×300 bp. However, I want to use DADA2 as well as feature-classifier fit-classifier-sklearn
for my data. I know that there is a way to concatenate reads that are not overlapping in DADA2 using the justConcatenate
-option. In a bunch of forum posts that @Mehrbod_Estaki handled, it was advised to not use the justConcatenate
-option in DADA2 and rather go with the forward reads only. I generally do agree with that, but given the relatively low taxonomic resolution of 18S rRNA, I would really like to have the V4 covered by my forward reads and the V5 region by my reverse reads so that I have higher taxonomic resolution than using just forward or reverse.
Today I played around with a very small subset of data. Thanks to this nice forum post, I was able to use the justConcatenate
-option in DADA2 in R and import the data in QIIME 2. Here is my test sequence both as rep-seqs.qza (5.3 KB) and rep.qzv (190.8 KB). Here is also the corresponding feature table table.qza (6.9 KB). As expected, the forward and reverse read are spaced by 10 Ns. And finally, here is my question: do the 10 Ns affect the classification success of the feature-classifier classify-sklearn
in any way? Is there anything that I need to be careful about or aware of?
Thanks for your help
Lukas