Training classifier with MaarjAM

Nicholas_Bokulich · January 10, 2018, 9:55pm

Is this with NCBI BLAST or with the classify-consensus-blast method in q2-feature-classifier? I am assuming the former.

It looks like MaarjAM only contains AMF sequences, so the classifier may be struggling with classification of non-target sequences. Are contaminant sequences receiving rather deep classifications (e.g., genus or species level) or are these receiving kingdom or phylum-level classification? Could you provide a couple examples (sequence and classification) for contaminant sequences?

It is potentially concerning if non-target sequences are receiving deep classifications, but we need to consider a few issues:

A restricted database like MaarjAM (i.e., to specifically AMF) is potentially problematic if non-target DNA is expected to be present, e.g., if primer sets are not specific to the sequences in the database. This is because you are only training the classifier on a small subset of data, so the classifier is overfitting to these sequences. It will then classify non-target sequences poorly because
The 18S is much less variable among fungi (and possibly eukarya in general) than the 16S is for bacteria.

So other eukarya that are hit by the primers will appear to be AMF, since they will have relatively similar 18S sequences and because you are training on a small reference database without any outgroups or coverage of all potential primer targets.

I would recommend two things:

Use classify-consensus-vsearch with a high perc-identity setting (e.g., 99%). Thus, only sequences with a high degree of alignment similarity will classify.
Try using the SILVA reference database instead (a pre-trained classifier containing 18S + 16S is here but you may want to train just on 18S sequences if bacteria are not hit by your primers).

You could either combine these steps (use vsearch with SILVA) or do a two-step approach, i.e., first use vsearch against MaarjAM with a high similarity threshold, then seqs that fail to classify you can pull out and classify in a second pass against the SILVA database (possibly with a lower precent ID or a different classifier).

Now you have me a little confused — it sounds like you are training a classifier for using classify-sklearn (this is my assumption in my answer above), but that method does not have a % similarity parameter. It sounds now like you are using classify-consensus-blast or vsearch, which do not have a training step. Could you please clarify the exact commands you are using for training and classification?

Other users have asked about MaarjAM (here and here) — but that's all I know, not whether it worked! You could get in touch with those users via direct message to compare notes!

I hope that helps! Please let us know if any of this works and/or give us more details on the commands that you are using. Thanks!