Understanding output from classify-hybrid-vsearch-sklearn

bkayser · April 23, 2020, 4:11pm

I have run the classify-hybrid-vsearch-sklearn feature classifier and, based on the output, I want to make sure that it produced what it is supposed to.

The consensus column in the output reports "1.0" and the method column reports "sklearn" for every feature. Should these values differ depending upon whether the taxonomy was determined by exact match or the feature classifier? In addition to making sure the program ran correctly, it would be very handy to know which annotations were produced from exact sequence matching.

I've copied my code below, which was run in QIIME 2019.10:

Need to import green genes data into qiime2 artifacts

qiime tools import
--type 'FeatureData[Sequence]'
--input-path 99_otus.fasta
--output-path 99_otus.qza

qiime feature-classifier extract-reads
--i-sequences 99_otus.qza
--p-f-primer GTGCCAGCMGCCGCGGTAA
--p-r-primer GGACTACHVGGGTWTCTAAT
--p-min-length 100
--p-max-length 400
--o-reads ref-seqs-gg99.qza

qiime tools import
--type 'FeatureData[Taxonomy]'
--input-format HeaderlessTSVTaxonomyFormat
--input-path 99_otu_taxonomy.txt
--output-path ref-taxonomy-gg99.qza

Run the classifier - GG

qiime feature-classifier classify-hybrid-vsearch-sklearn
--i-classifier gg-13-8-99-515-806-nb-classifier.qza
--i-query rep-seqs-dada.qza
--i-reference-reads ref-seqs-gg99.qza
--i-reference-taxonomy ref-taxonomy-gg99.qza
--p-threads 3
--o-classification taxonomy-gg-hybrid-dada.qza

Create a summary object for taxonomy table - GG

qiime metadata tabulate
--m-input-file taxonomy-gg-hybrid-dada.qza
--o-visualization taxonomy-gg-hybrid-dada.qzv

Nicholas_Bokulich · April 23, 2020, 4:36pm

Hi @bkayser,
Thanks for testing out this method — note the warning message in the help docs that this particular method is still an experimental "alpha release" so has not been fully benchmarked etc.

Yes, these columns report the method used for classification (sklearn is only used if no exact match is found), and the consensus (for LCA following vsearch) and confidence scores (for sklearn).

Evidently no exact matches were found, if the "method" column reports sklearn for every feature.

Exact matches are end-to-end exact matches, so the reference sequences and query sequences must be identical. Looks like you are trimming the reference sequences, but unless if the query sequences are the same exact site (i.e., paired-end V4 seqs with primers removed), then you will not get exact matches.

I hope that helps!

bkayser · April 23, 2020, 4:54pm

Thanks for the quick response @Nicholas_Bokulich.

I actually tried trimming the reference sequences because I had the same result (no exact matches) when I used the full length sequences (the 99_otus.fasta file from the data resources page).

I am able to get plenty of exact matches when I use the dada2::addSpecies function in R on the same rep-seqs file, so I must be doing something wrong in QIIME2.

Could you clarify the exact match for me? Assuming representative sequence A from my samples is in the GreenGenes database, will an exact match be found if sequence A is 250 nt in length but all the reference sequences contain the full length 16S sequences? Or do I need to trim all of my representative sequences and all of the reference sequences to 250 nt?

I am learning an awful lot about bioinformatics using QIIME2 and your great support, so thanks for everything!

Nicholas_Bokulich · April 23, 2020, 5:03pm

an exact match there might not be an end-to-end exact match, just 100% coverage of the query when aligning to the ref.

No — the ref seq must be trimmed to the same exact length. There can be many reasons both technical and biological for why no exact matches are found, even once you trim to the same site, e.g., even a slight mismatch during read joining will cause exact match to fail.

The reason I set up the method this way is because VSEARCH has an exact match method that is blazing fast, since it actually skips alignment and does a dictionary lookup instead (I think) — but that dict lookup means that even a 1nt mismatch fails. I have considered switching to looking for an exact match via alignment because getting that exact match can be so difficult in practice, as you've found, since it requires trimming the reference sequences. But then the alignment step would be super slow...

system · May 24, 2020, 11:04pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.