I am trying to train a feature classifier using the 18S AMF database MaarjAM. I followed the tutorial for training the classifier, but it is not working correctly. When I BLAST features from my feature table, the top features all assign to various copepods (half of my samples are from coastal sand dunes, so this is not surprising). I also expect other contaminants. However, my classifier classifies ALL of my features as AMF. I tried changing the percent similarity threshold from 80% (preset) to 95% and above, but this didn't change the number of features classified as AMF.
I haven't been able to find anyone else who has trained the feature classifier using MaarjAM. Does anyone have any ideas about what is causing this? My alternative is using the BLAST taxonomic assignment, but I seem to be having trouble with that as well.
Thanks! Let me know if I can provide any more info.
Is this with NCBI BLAST or with the classify-consensus-blast method in q2-feature-classifier? I am assuming the former.
It looks like MaarjAM only contains AMF sequences, so the classifier may be struggling with classification of non-target sequences. Are contaminant sequences receiving rather deep classifications (e.g., genus or species level) or are these receiving kingdom or phylum-level classification? Could you provide a couple examples (sequence and classification) for contaminant sequences?
It is potentially concerning if non-target sequences are receiving deep classifications, but we need to consider a few issues:
A restricted database like MaarjAM (i.e., to specifically AMF) is potentially problematic if non-target DNA is expected to be present, e.g., if primer sets are not specific to the sequences in the database. This is because you are only training the classifier on a small subset of data, so the classifier is overfitting to these sequences. It will then classify non-target sequences poorly because
The 18S is much less variable among fungi (and possibly eukarya in general) than the 16S is for bacteria.
So other eukarya that are hit by the primers will appear to be AMF, since they will have relatively similar 18S sequences and because you are training on a small reference database without any outgroups or coverage of all potential primer targets.
I would recommend two things:
Use classify-consensus-vsearch with a high perc-identity setting (e.g., 99%). Thus, only sequences with a high degree of alignment similarity will classify.
Try using the SILVA reference database instead (a pre-trained classifier containing 18S + 16S is here but you may want to train just on 18S sequences if bacteria are not hit by your primers).
You could either combine these steps (use vsearch with SILVA) or do a two-step approach, i.e., first use vsearch against MaarjAM with a high similarity threshold, then seqs that fail to classify you can pull out and classify in a second pass against the SILVA database (possibly with a lower precent ID or a different classifier).
Now you have me a little confused — it sounds like you are training a classifier for using classify-sklearn (this is my assumption in my answer above), but that method does not have a % similarity parameter. It sounds now like you are using classify-consensus-blast or vsearch, which do not have a training step. Could you please clarify the exact commands you are using for training and classification?
Other users have asked about MaarjAM (here and here) — but that's all I know, not whether it worked! You could get in touch with those users via direct message to compare notes!
I hope that helps! Please let us know if any of this works and/or give us more details on the commands that you are using. Thanks!
So, it looks like I am not receiving a particularly deep classification. My collaborator @Lorinda has shared an updated database with me which includes some non-target sequences that they run into a lot, which I think overlap some but not completely with my non-target sequences. She suggested adding my contaminants and seeing if that helped.
I am referring to extracting reads from my sequences. This is what I used:
qiime feature-classifier extract-reads
--i-sequences MaarjAM-MPG.qza
--p-f-primer CAGCCGCGGTAATTCCAGCT
--p-r-primer TGAACCCAAACACTTTGGTTTCC
--p-trunc-len 280
--p-identity 1.0
--o-reads 18Sref-seqs.qza
If I change --p-identity from the present 0.8 to anything else above that, I did not change the number of features classified by my classifier.
I'll include the rest of the commands I used for training my classifier just for future reference:
Thanks @Nicholas_Bokulich for the response on this. I've been able to get the BLAST classifier to work, so I may end up using that instead if I can't get MaarjAM to work out. I'll play around with the options you suggested.
Indeed... I think that this is consistent with my point that the lack of non-target sequences in the database is causing the classifier to overfit. Their 18S genes still bear enough resemblance to Glomeromycota in the database that the classifier is still classifying them.
That should definitely help! But maybe not 100% if there are other non-target sequences present. I would still recommend comparing your results to classification with SILVA, and classification with the blast/vsearch classifiers in feature-classifier. MaarjAM would probably give more satisfying results (e.g., deeper classifications) than SILVA, but would likely overclassify non-target and other sequences not in the training set. The large size of SILVA would introduce a lot of noise into the database, but the classifications might be more reliable. Worth a try at least.
Thanks for clarifying! The identity parameter in extract-reads just sets the % similarity for the primer alignment. This should not have any effect (unless if your primers don't match some seqs in the maarjam database). Toggling this parameter would make more sense, e.g., if you had AMF-specific primers and were trying to train a SILVA classifier just to include these fungi.
classify-consensus-blast and classify-consensus-vsearch should both have similar behavior, and playing around with the identity and other parameters should give you a little more control to prevent overfitting with the maarjam database. Give the two-step approach that I described a try — it may give you the best of both worlds: very specific classifications to sequences that match the maarjam database closely, but reliable classification to non-target sequences and other fungal hits that are not in maarjam.