classification of non-overlapping reads treated with justConcatenate

Hi all,

I do have a question regarding the feature-classifier classify-sklearn.
Let me briefly introduce the background that leads to my question.
In an upcoming project, I want to sequence metazoa in bulk soil. Everyone would recommend to use COI as a marker gene but I decided for 18S rRNA because in most previous studies that conducted sequencing from bulk soil using COI, the proportion of metazoan reads was often less than 5% (if more than 5% were metazoan reads, a high number of touchdown cycles was used during PCR, which is not what I want to do). I guess COI is too variable (which isn't bad in terms of taxonomic resolution though) to design primers specific for metazoa. I found a primer set that amplifies a ~600 bp amplicon that appears > 99% specific for soil metazoa. The forward primer is located before the V4 region and the reverse primer after the V5 region. I did a few alignments and the reverse primer is the one that is really specific for metazoa, the forward primer appears to be more like a universal eukaryote primer. I'm also working on ways to shorten the amplicon (e.g. nested PCR) but for now let's just say that I have no choice but to sequence a 600 bp amplicon.

Now, a 600 bp amplicon isn't ideal because I will never have any overlap between forward and reverse reads if I sequence on a MiSeq with 2×300 bp. However, I want to use DADA2 as well as feature-classifier fit-classifier-sklearn for my data. I know that there is a way to concatenate reads that are not overlapping in DADA2 using the justConcatenate-option. In a bunch of forum posts that @Mehrbod_Estaki handled, it was advised to not use the justConcatenate-option in DADA2 and rather go with the forward reads only. I generally do agree with that, but given the relatively low taxonomic resolution of 18S rRNA, I would really like to have the V4 covered by my forward reads and the V5 region by my reverse reads so that I have higher taxonomic resolution than using just forward or reverse.

Today I played around with a very small subset of data. Thanks to this nice forum post, I was able to use the justConcatenate-option in DADA2 in R and import the data in QIIME 2. Here is my test sequence both as rep-seqs.qza (5.3 KB) and rep.qzv (190.8 KB). Here is also the corresponding feature table table.qza (6.9 KB). As expected, the forward and reverse read are spaced by 10 Ns. And finally, here is my question: do the 10 Ns affect the classification success of the feature-classifier classify-sklearn in any way? Is there anything that I need to be careful about or aware of?

Thanks for your help :beetle:


1 Like

Hi @lukasbeule,

What COI primer sets have you tried? I find this surprising and has not been my experience.

This approach work quite well, especially if your touch down portion of the protocol is no more than ~10 cycles, give or take. I’ve had success using both Folmer and my own clade specific primers, and can also be combined with nested PCR too. See Robeson et al. 2011 for more details.

RE: 18S & Ns
I think it would be advisable to only use the forward reads. The addition of Ns may result in as bad, or worse, taxonomic specificity as compared to simply using the forward reads. 18S can vary quite a bit in length, even within the variable regions, and the addition of 10 bp can either be too little or too long for many clades. The impact of these Ns on classifiers can be wonky. I would think that classification approaches similar to vsearch/usearch might be less negatively impacted compared to sklearn. But I’ve not tested this. You could try making a mock community and test your classifactions using vsearch and sklearn. If you do, keep us posted!



Thanks for your response @SoilRotifer,

I haven’t tried any primers and haven’t done any sequencing yet. It’s good to know that you have difference experience with COI primers. I’m afraid though to receive low % of metazoan reads from soil samples as for example here Marquina et al. 2019; I quote:

The soil samples present a very different picture […]. Most COI sequences belong to Fungi and other phyla and kingdoms (the unidentified and the non‐Arthropoda sequences sum up to 96.37% of the COI reads).

That’s true, touch-down cycles will of course make your PCR more specific. I use touch-down cycles a lot for real-time PCR; I also use touch-up cycles for tailed primers during library preparation. However, I believe that touch-down cycles in combination with highly degenerate primers may not be ideal for sequencing as thermocycles with high annealing temperature will favour GC-rich primer variants and, thus, introduce bias. In theory, I would expect less OTUs with touch-down cycles because the first cycles select for only a subset of the taxa, these taxa then become sooner amplified and taxa selected by AT-rich primer variants will be suppressed during PCR. It is anecdotal but this agrees with experimental findings of Clarke et al. 2017 (see Table 2). Anyway, I probably overdramatize the impact of touch-down cycles but this is something that I could argue about for days :smile:.

Thanks for sharing your opinion regarding the classifiers! As you now know, I don’t have my own dataset yet. I will for sure play around with the different classifiers that QIIME 2 offers using published datasets and let you know the outcome.

If anyone else has experience with the performance of the difference classifiers and the introduction of Ns, I would be happy if they share it here :slight_smile:.

– Lukas


Keep in mind this only pertains to the primer sets they used. Obviously you know this, :wink:.

Totally true! But will it affect the question you are asking of your data? Also, this highlights the issue of adequate primer design… that is, many researchers just make primers ever-more degenerate which is not always needed. As we know… primers can amplify off-targets to which they have minimal matches, bioinformatically speaking. So, I always try to avoid over degenerization of my primers. Oh so many caveats, eh?! :weary:

Believe me I hear you! It took me a while to figure out the best balance between being able to generate usable data and minimizing biases. Sometimes you simply have to go with what is best given your research question. :slight_smile:

Also, you can try out mixing primers. This is what we did in Robeson et al. 2017, we compared a set of existing primer sets and chose the ones (the forward and reverse primers across these primer sets) that hit our target organisms of interest. So, you are not strictly limited to existing primer-pair combinations.

I forgot to reference some of Terri Porter’s excellent work. There is a large list of primers you can check out there too.

Perhaps @devonorourke will have some helpful insights too.


Hi @SoilRotifer,

I just ran a test using the 18S rRNA sequence of Folsomia candida :bug: (see F_candida_sklearn.qzv (1.2 MB) for the classification and F_candida_seqs.qzv (192.2 KB) for the sequences) using sklearn (brief details: Silva 132 99 18S; feature-classifier fit-classifier-naive-bayes; --p-feat-ext--ngram-range [6,6]; feature-classifier classify-sklearn; majority_taxonomy_all_levels.txt; --p-confidence 0.8).

For the test, I used the full 18S rRNA sequence of Folsomia candida (F_candida_full), which was classified as D_0__Eukaryota;D_1__Opisthokonta;D_2__Holozoa;D_3__Metazoa (Animalia);D_4__Eumetazoa;D_5__Bilateria;D_6__Arthropoda;D_7__Hexapoda;D_8__Ellipura;D_9__Collembola at high confidence (0.9999994126501317).

Then, I did subsets of the sequence using the first 80 bp (F_candida_subset_F) and the last 161 bp (F_candida_subset_R) of the sequence:

F_candida_subset_F was classified as D_0__Eukaryota;D_1__Opisthokonta;D_2__Holozoa;D_3__Metazoa (Animalia);D_4__Eumetazoa;D_5__Bilateria (conf.: 0.8100814708932081)

F_candida_subset_R was classified as
D_0__Eukaryota;D_1__Opisthokonta;D_2__Holozoa;D_3__Metazoa (Animalia);D_4__Eumetazoa;D_5__Bilateria;D_6__Arthropoda;D_7__Hexapoda (conf.: 0.8297322568444817).

This was expected since less bps will at some point yield in lower taxonomic resolution because we discard information.

Next, I concatenated the two subsets (F_candida_subset_F and F_candida_subset_R) using 0 - 12 Ns. The concatenated subsets that were successfully classified as D_9__Collembola had 0, 1, 2, 3, 4, 5, 6, 7 or 10 Ns with 4 Ns having the highest confidence (0.8944802320115267) and 10 Ns the lowest confidence (0.8153876278129408). This tells me that the classifier accounted for both F_candida_subset_F and F_candida_subset_R when concatenated using these numbers of Ns.

Interestingly, the subset with 9 Ns had the worst overall performance: D_0__Eukaryota;D_1__Opisthokonta;D_2__Holozoa;D_3__Metazoa (Animalia);D_4__Eumetazoa (conf.: 0.8044761967865404).

Next, I classified the same sequences using VSEARCH using the default settings: F_candida_vsearch.qzv (1.2 MB). This tells me that VSEARCH cannot handle Ns because all subsets that were concatenated by at least 1 N were classified as Unassigned.

Do you have any thoughts on this?

-- Lukas

Wow this is great @lukasbeule, thank you for sharing this!

I agree that the initial results you’ve obtained with sklearn are what we expected. I hope the N length you have chosen works reasonable well across most taxa. However, as per:

I’ve discussed this previously with some of the other forum moderators, and I had forgotten to clarify in my reply to you, that vsearch may only work if the N region is a reasonable length (otherwise the global alignment would score really badly).

One minor suggestion, we have updated SILVA 138 classifiers available on the Data Resources page. Both pre-made classifiers and the files used to make them are there. If you’d like to curate your own SILVA database, you can give RESCRIPt a try:



Hi again @SoilRotifer,

yes, sklearn worked pretty good for this one sequence. However, one sequence is an anecdote… but hey, the plural of anecdote is data: I have a friend that has a bunch of non-overlapping fungal ITS reads and I will give that a try soon and keep you posted.

Alright, now it all makes sense to me, thanks for the clarification!

Thanks, I am using 138 already but this week I am on holidays (don’t tell my girlfriend that I do bioinformatics during my holidays :shushing_face:) so I can only use my private laptop for this. I will definitely try RESCRIPt soon!

– Lukas


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.