Classify-sklearn low classification depth

ErikaGanda · December 20, 2017, 6:17pm

Hello, I am having a similar issue using sklearn.

I have added the code I ran and the QZV file. Very poor classification when I used greengenes. I am also trying SILVA but it has been running for a while.

(qiime2-2017.11) dhcp-morris-1937:Merged ErikaGanda$ qiime feature-classifier classify-sklearn \

--i-classifier gg-13-8-99-515-806-nb-classifier.qza
--i-reads rep-seqs.qza
--o-classification gg-taxonomy.qza

qiime metadata tabulate
--m-input-file gg-taxonomy.qza
--o-visualization gg-taxonomy.qzv

qiime taxa barplot
--i-table table.qza
--i-taxonomy gg-taxonomy.qza
--m-metadata-file sample-metadata.txt
--o-visualization gg-taxa-bar-plots.qzv
gg-taxa-bar-plots.qzv (1009.1 KB)

Your comments are greatly appreciated!

Nicholas_Bokulich · December 20, 2017, 6:56pm

Hi @ErikaGanda,
Your problem is a little bit different (the other post concerned use of vsearch specifically, and the sequences were not being assigned any taxonomy, yours are but receive shallow assignments), and so I have split into its own thread.

Unless if your reads are very short/low quality, you should indeed be getting much deeper classification with this classifier! So let's take a step back and examine how these reads were put together — you may be selecting the wrong classifier for your reads, depending on the primers that you are using.

Some other users have reported similar problems, e.g., here. This issue is usually caused by:

The wrong reference sequences are being used (or extracted improperly)
The query sequences are very short or low quality.

So, some questions about your data:

What primers are you using? I notice that for classification you use the Greengenes database with V4 domain extracted using the 515f/806r primers. Is this appropriate for your input sequences?
How long are your query sequences? Did you use QIIME2 for all upstream steps (e.g., dada2 for quality control) or are you importing these reads from elsewhere for taxonomy assignment?

That probably will not help here — if Greengenes is performing poorly, so will SILVA trained on the same amplicon region. SILVA takes around 30X the time to run because the database is much much bigger.

Let us know if the above helps sort out the issue!

ErikaGanda · December 20, 2017, 7:02pm

Thanks @Nicholas_Bokulich!

I think you have already shed a lot of light into my issue! I am using the wrong primers.

According to the materials and methods that were provided to me, I am using 342F and 806R. They also had ITS primers.
How should I proceed in this case?? I have added a section of the methods below:

To create an amplicon suitable for Illumina sequencing, fusion primers were designed that contained Illumina adaptor sequences (TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG) forward, (GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG reverse, and a gene-specific primer. In this instance, two sets of primers were designed using the V3/V4 region of the 16S rRNA gene, primers 342F (CTACGGGGGGCAGCAG) and 806R (GGACTACCG GGGTATCT) (Mori et al, 2014) and the internal transcribed spacer fungi primers ITS86F (GTGAATCATCGAATCTTTGA) (Turenne et al, 1999) and ITS4 (TCCTCCGCTTATTGATATGC) (White et al, 1990).

Nicholas_Bokulich · December 20, 2017, 7:44pm

Hi @ErikaGanda,
Wow, complicated setup! Thanks for providing the methods description — this makes things a lot clearer.

Do these primers all have the same barcodes? i.e., does sample A use the same barcode on both sets of primers? Or do the different primer sets have unique barcodes? I ask because it would probably make the most sense to demultiplex and analyze these as separate data sets if possible. Having a mixture of different primer regions in the same data set may cause issues when denoising with dada2 (assuming that is what you did). It would also seriously complicate classification, since many of the reads would (should) be unassigned when classifying with a given classifier.

You can follow the tutorial for training a feature classifier on a specific amplicon region. For each of your primer pairs, you can generate a tailored classifier with that amplicon region extracted. Just choose your reference database of choice.

For 16S rRNA genes, you can just use the full-length Greengenes classifier that is provided on the qiime2 site instead of trying to train a new one. There is a very slight decrease in accuracy (i.e., a classifier trained on your exact sub-domain is slightly more accurate) but in my experience the difference is not so extreme that it is imperative (e.g., if your computer lacks the memory to train a large classifier).

The other post that I linked to above has some specific notes you may find useful for training an ITS classifier — specifically, if you use the UNITE database make sure to use the "developer" version (it's included in the regular downloads in a separate directory. the "normal" version has the ITS domain itself extracted, which may trim off primers that are present in the rRNA subdomains that bracket the ITS)

I hope that helps! Let me know if the path forward is still unclear

system · January 21, 2018, 1:44am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.