Taxonomy Assignment to Cultured Isolates with Classifer

Hi there,

I am working on a project where we have Illumina 16S sequences (V4 and V1-V3) and also have cultured isolates.

Our goal is to determine where our cultured isolates fall in our sequencing data, so in order to keep the taxonomy assignment consistent, we are trying to use a trained classifier.

I trained the classifier with the primers we used for sequencing with Sanger sequencing, and I tried both –p-identity 0.9 and –p-identity 0.8 in the extract reads step. I used the 7 levels taxonomy, majority, from SILVA.

Regardless, we’re getting a majority of reads being assigned to D0_Bacteria, with no other taxonomic levels. When you take these same sequences and BLAST them, or run them through the SINA aligner, they have assignments down to the genus/species level.

The same feature classifier parameters worked well when used with our illumina data (different primer sets, though).
Our Sanger sequencing reads fall between the 27F and 1492R. (we sequenced either forward or backwards & reverse complemented.

Any help on why our feature classifier may be behaving this way, or alternative suggestions on how to compare the taxonomy between these sequencing sets would be greatly appreciated!

Thanks,
Claire

Hi @clairewill22,
Sorry to hear you’re running into issues!

How are you training your own classifier? can you share your command? Make sure you use the min and max length parameters with appropriate thresholds… a common issue with training SILVA classifiers specifically is that junk sequences left in SILVA (e.g., with lots of ambiguous bases) can cause hits to disparate kingdoms, causing the classifier to get confused (search “hot spring metagenome” in the forum archive for some examples!).

Also keep your eyes peeled for when the latest QIIME 2 release comes out this week, and follow-ups… the newest pre-trained classifiers may relieve some of these headaches.

Finally, are the 27F/1492R primers included in your query (sanger) sequences? I’d be surprised if their inclusion is causing this issue, but it’s worth looking if the above advice does not lead to a diagnosis.

Hi @Nicholas_Bokulich,
Thanks for your response!

Here’s the commands for the classifier training:

qiime feature-classifier extract-reads
–i-sequences 99_16S_repset.qza
–p-f-primer AGAGTTTGATCCTGGCTCAG
–p-r-primer GGTTACCTTGTTAGGACTT
–o-reads ref-seqs.qza

qiime feature-classifier fit-classifier-naive-bayes
–i-reference-reads ref-seqs.qza
–i-reference-taxonomy ref-taxonomy-16S-silva-7levels.qza
–o-classifier classifier-isos-cwtrained.qza

qiime feature-classifier classify-sklearn
–i-classifier classifier-isos-cwtrained.qza
–i-reads …/rep-seqs.qza
–o-classification cw-trained-taxonomy-isolates.qza

I did not use the min and max length parameters. That’s a good suggestion as I did end up with a bunch of short ref-seqs and didn’t know how to fix this!

The parameters apply to the SILVA database sequences, right? If so, what would be a good minimum length? A bit below the min length for the target section?

Our Sanger sequences have been trimmed to remove primers – thanks for the note, though!

Bingo! The defaults are something like 100 min and 400 max (these defaults are set with short amplicon seqs in mind), so that spells bad news for a full-length classifier.

Instead of re-doing this, I recommend waiting a day or so for the new release of QIIME 2 to come out… the pre-trained full length classifier should work for you, and we prepared this database in a new way (keep your eyes peeled for the release notes for more details).

The SILVA 138 rep seqs and taxonomy formatted for QIIME 2 will also be released alongside those in case you want to train on those sequences (the SILVA website only has 132 release in QIIME-compatible formats), so keep an eye out.