Training classifier with MaarjAM

mmcintosh · January 10, 2018, 9:24pm

Hello!

I am trying to train a feature classifier using the 18S AMF database MaarjAM. I followed the tutorial for training the classifier, but it is not working correctly. When I BLAST features from my feature table, the top features all assign to various copepods (half of my samples are from coastal sand dunes, so this is not surprising). I also expect other contaminants. However, my classifier classifies ALL of my features as AMF. I tried changing the percent similarity threshold from 80% (preset) to 95% and above, but this didn't change the number of features classified as AMF.

I haven't been able to find anyone else who has trained the feature classifier using MaarjAM. Does anyone have any ideas about what is causing this? My alternative is using the BLAST taxonomic assignment, but I seem to be having trouble with that as well.

Thanks! Let me know if I can provide any more info.

Mariah

Nicholas_Bokulich · January 10, 2018, 9:55pm

Hi @mmcintosh,

Is this with NCBI BLAST or with the classify-consensus-blast method in q2-feature-classifier? I am assuming the former.

It looks like MaarjAM only contains AMF sequences, so the classifier may be struggling with classification of non-target sequences. Are contaminant sequences receiving rather deep classifications (e.g., genus or species level) or are these receiving kingdom or phylum-level classification? Could you provide a couple examples (sequence and classification) for contaminant sequences?

It is potentially concerning if non-target sequences are receiving deep classifications, but we need to consider a few issues:

A restricted database like MaarjAM (i.e., to specifically AMF) is potentially problematic if non-target DNA is expected to be present, e.g., if primer sets are not specific to the sequences in the database. This is because you are only training the classifier on a small subset of data, so the classifier is overfitting to these sequences. It will then classify non-target sequences poorly because
The 18S is much less variable among fungi (and possibly eukarya in general) than the 16S is for bacteria.

So other eukarya that are hit by the primers will appear to be AMF, since they will have relatively similar 18S sequences and because you are training on a small reference database without any outgroups or coverage of all potential primer targets.

I would recommend two things:

Use classify-consensus-vsearch with a high perc-identity setting (e.g., 99%). Thus, only sequences with a high degree of alignment similarity will classify.
Try using the SILVA reference database instead (a pre-trained classifier containing 18S + 16S is here but you may want to train just on 18S sequences if bacteria are not hit by your primers).

You could either combine these steps (use vsearch with SILVA) or do a two-step approach, i.e., first use vsearch against MaarjAM with a high similarity threshold, then seqs that fail to classify you can pull out and classify in a second pass against the SILVA database (possibly with a lower precent ID or a different classifier).

Now you have me a little confused — it sounds like you are training a classifier for using classify-sklearn (this is my assumption in my answer above), but that method does not have a % similarity parameter. It sounds now like you are using classify-consensus-blast or vsearch, which do not have a training step. Could you please clarify the exact commands you are using for training and classification?

Other users have asked about MaarjAM (here and here) — but that's all I know, not whether it worked! You could get in touch with those users via direct message to compare notes!

I hope that helps! Please let us know if any of this works and/or give us more details on the commands that you are using. Thanks!

mmcintosh · January 10, 2018, 11:16pm

Yes, I am using the NCBI BLAST associated with the rep seqs visualization.

Here are the classifications for my three most common features (non-target) along with the NCBI BLAST link:

b631e7dda8b6f474246df6eb69c64985 Fungi;Glomeromycota;Glomeromycetes
http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=GCGGTTAAAAAGCTCGTAGTTGGATCCGGGCAGTCGACGCGCGGTGCACCTTACGGTGTTACTGCGTGCTTCGGCGCCACTATCCGGCCATGTCTTGTATGTTCTTTATTGAGCGTACTTGGCGACCGGGACGTTTACTTTGAAAAAATTAGAGTGCTTAAAGCAGGCATATGGTCTTGCATAATGGTGCATGGAATAATGGAATAGGACCTCGGTTCTATTTTGTTGGTTT

63b408700fd3f60d9fb0d0af5a64ed7f Fungi;Glomeromycota
http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=GCGGTTAAAAAGCTCGTAGTTGGATCTCGGCGGGTAGGGGACGGTCCGCTGTTTGGCGCGACTGTCCCATTCGCCCGTGAATGGTGTGGCGGAGGCGTCAGGGTGCTCTTAACCGAGTGTCCTGGGGCGCCGCCAGGTTTACTTTGAAAAAATTAGAGTGCTCAAAGCAGGCTTTACACGCTTGGATATTCGTGCATGGAATAATGGAATAGGACGTCGTTTCTATTTTGTT

c5feef5a1bd2a1e25082f2d3ac9e98f5 Fungi;Glomeromycota;Glomeromycetes
http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?ALIGNMENT_VIEW=Pairwise&PROGRAM=blastn&DATABASE=nt&CMD=Put&QUERY=GCGGTTAAAACGCCCGTAGTTGGATCAGCGGATTTGGAGAACGGTCACCCGATGGGCTGTACTGTTCTCCTGTCCTAATTTCTCTAGTTCGCTCTAGGTGCTCTTCACTGAGTGTCTAGAGTGGCTGGAACTTTTACTTTGAAAAAATTAGAGTGCTCAAAGCAGGCGTCTCAGCCTGAATAGTGGTGCATGGAATAATGGAATAGGACCTCAGTTCTATTTTGTTGGTTTT

So, it looks like I am not receiving a particularly deep classification. My collaborator @Lorinda has shared an updated database with me which includes some non-target sequences that they run into a lot, which I think overlap some but not completely with my non-target sequences. She suggested adding my contaminants and seeing if that helped.

I am referring to extracting reads from my sequences. This is what I used:
qiime feature-classifier extract-reads
--i-sequences MaarjAM-MPG.qza
--p-f-primer CAGCCGCGGTAATTCCAGCT
--p-r-primer TGAACCCAAACACTTTGGTTTCC
--p-trunc-len 280
--p-identity 1.0
--o-reads 18Sref-seqs.qza

If I change --p-identity from the present 0.8 to anything else above that, I did not change the number of features classified by my classifier.

I'll include the rest of the commands I used for training my classifier just for future reference:

qiime tools import
--type 'FeatureData[Sequence]'
--input-path MaarjAM-MPG.fasta
--output-path MaarjAM-MPG.qza

qiime tools import
--type 'FeatureData[Taxonomy]'
--source-format HeaderlessTSVTaxonomyFormat
--input-path MaarjAM-MPG.txt
--output-path ref-taxonomy.qza

qiime feature-classifier extract-reads
--i-sequences MaarjAM-MPG.qza
--p-f-primer CAGCCGCGGTAATTCCAGCT
--p-r-primer TGAACCCAAACACTTTGGTTTCC
--p-trunc-len 280
--p-identity 1.0
--o-reads 18Sref-seqs.qza

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads 18Sref-seqs.qza
--i-reference-taxonomy ref-taxonomy.qza
--o-classifier 18Sclassifier.qza

qiime feature-classifier classify-sklearn
--i-classifier 18Sclassifier.qza
--i-reads 18Srep-seqs.qza
--o-classification 18Staxonomy.qza

qiime metadata tabulate
--m-input-file 18Staxonomy.qza
--o-visualization 18Staxonomy.qzv

Thanks @Nicholas_Bokulich for the response on this. I've been able to get the BLAST classifier to work, so I may end up using that instead if I can't get MaarjAM to work out. I'll play around with the options you suggested.

Mariah

Nicholas_Bokulich · January 10, 2018, 11:32pm

Hi @mmcintosh,
Thanks for the details!

Indeed... I think that this is consistent with my point that the lack of non-target sequences in the database is causing the classifier to overfit. Their 18S genes still bear enough resemblance to Glomeromycota in the database that the classifier is still classifying them.

That should definitely help! But maybe not 100% if there are other non-target sequences present. I would still recommend comparing your results to classification with SILVA, and classification with the blast/vsearch classifiers in feature-classifier. MaarjAM would probably give more satisfying results (e.g., deeper classifications) than SILVA, but would likely overclassify non-target and other sequences not in the training set. The large size of SILVA would introduce a lot of noise into the database, but the classifications might be more reliable. Worth a try at least.

Thanks for clarifying! The identity parameter in extract-reads just sets the % similarity for the primer alignment. This should not have any effect (unless if your primers don't match some seqs in the maarjam database). Toggling this parameter would make more sense, e.g., if you had AMF-specific primers and were trying to train a SILVA classifier just to include these fungi.

classify-consensus-blast and classify-consensus-vsearch should both have similar behavior, and playing around with the identity and other parameters should give you a little more control to prevent overfitting with the maarjam database. Give the two-step approach that I described a try — it may give you the best of both worlds: very specific classifications to sequences that match the maarjam database closely, but reliable classification to non-target sequences and other fungal hits that are not in maarjam.

I hope that helps! Happy QIIMEing!

system · February 11, 2018, 5:32am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.