Train you classifier - taxonomy

Jasmine84 · March 14, 2018, 7:49am

Hello

My taxa-bar-plot and taxonomy is not as good as I hoped for. Mostly classification to order level and many sequences only to kingdom level. How can I improve my results? To train the classifier as in the tutorial? I am not sure which artifacts to use besides rep-seqs.qza.

Best regards
Jasmine

Nicholas_Bokulich · March 14, 2018, 12:52pm

Hi @Jasmine84,

See this post. Depending on the length of your sequences, the primers you are using, etc, deeper than order level might be unreliable. On 250 bp 16S rRNA gene sequences, we can usually get genus level and some species with a good degree of reliability. The naive bayes classifier in QIIME2 is designed to be cautious and reliable, to avoid false-positive errors, but you can lower the confidence parameter to make it less cautious.

If you have:

only kingdom-level classification, it is usually a result of human error (some users have reported this and usually they used the wrong database). That does not sound like your problem.
a mixture of shallow (kingdom) and deep (species) level classification, it is usually an issue with either contaminant DNA present in the samples, or with some sequences being unusually short. See here for ways to diagnose this.
mostly order-level classification, it is probably either characteristic of:
a. the marker gene/primers
b. the database
c. the length of your sequences
d. all of the above.

So, please give us more details. What primers, database, and length of sequences are you using?

How are you currently training your classifier? That could be the issue here. Yes, you should train your classifier following the tutorial (but with the appropriate database) or use one of our pre-trained classifiers for 16S rRNA gene data.

Without the right conditions, taxonomy results can often be disappointing . We all want species, but sometimes our data can only yield so much information. Again, 16S rRNA data with 250 bp reads can usually get genus level reliably and some species, but other marker genes and shorter read lengths can often be much less satisfying. Go through the steps above and let's see if it's possible to improve this in your data.

Nicholas_Bokulich · March 15, 2018, 12:51pm

An off-topic reply has been split into a new topic: Error from vsearch join-reads: exit code 1

Please keep replies on-topic in the future.

Jasmine84 · March 23, 2018, 1:16pm

Hello again

I have 450 bp long sequences and I am trying to create my own classifier, by using the tutorial train your classifier. However, I receive this error message: (the primers should be correct)

What other things should I check to solve this problem? (below the command used)

qiime tools import
--type 'FeatureData[Sequence]'
--input-path 99_otus.fasta
--output-path 99_otus.qza

qiime tools import
--type 'FeatureData[Taxonomy]'
--source-format HeaderlessTSVTaxonomyFormat
--input-path 99_otu_taxonomy.txt
--output-path 99_ref-taxonomy.qza

qiime feature-classifier extract-reads
--i-sequences 99_otus.qza
--p-f-primer TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG
--p-r-primer GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC
--p-trunc-len 450
--o-reads 99_450_ref-seqs.qza

Nicholas_Bokulich · March 23, 2018, 4:46pm

See this forum thread for a solution to an identical problem.

Good luck!

Jasmine84 · March 27, 2018, 8:38am

It still does not work.

qiime feature-classifier extract-reads --i-sequences $dir/rep-seqs_450.qza --p-f-primer CCTACGGGNGGCWGCAG --p-r-primer GACTACHVGGGTATCTAATCC --o-reads $dir/ref-seqs_450.qza

Do you have any suggestions to which primer to use?

I received this information from my lab technician:
5’ CAAGCAGAAGACGGCATACGAGAT[i7]GTCTCGTGGGCTCGG

5’ AATGATACGGCGACCACCGAGATCTACAC[i5]TCGTCGGCAGCGTC

og dette er 16s primerne

Forward: TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG
Reverse:GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC
insert site is 550 bp and we sequenced 600 bp

Jasmine84 · March 27, 2018, 9:20am

Not it worked! Next error is however this one:

Can you help me with this?

thermokarst · March 27, 2018, 4:43pm

Hi @Jasmine84 - I think this error means that your reference reads and your reference taxonomy have no overlapping feature IDs. This can happen if you are accidentally combining files from different reference databases, for example. Have you have a chance to look at this tutorial?

If you need more assistance, please send us links to download the two QZA files used as inputs in that failing command. Thanks!

Jasmine84 · April 4, 2018, 7:18am

Yes. I wondered over this when I tried to create my classifier. We have downloadet 99_otus_fasta and 99_otu_taxonomy.txt (green genes database) and created 99_ref-taxonomy.qza and classifier.qza through the tutorial you have linked to. However, my rep-seqs.qza are directly linked to NCBI database.

So, how can I further solve this problem.

thermokarst · April 4, 2018, 9:40pm

Please see my request from 8 days ago:

Jasmine84 · April 12, 2018, 11:36am

filer.zip (2.8 MB)

Here you are

thermokarst · April 13, 2018, 3:25am

Thanks @Jasmine84 --- I used the decentralized provenance to look two of the files you provided:

rep-seqs_450.qza

ref-seqs_450.qza

It looks like you mixed up your files - your reference sequences should be the sequences provided as part of the taxonomic database - not from your own reads. Double-check the Feature classifier tutorial for a solid example - that tutorial shows how to import the greengenes reference reads, and then extract the region of interest.

Hope that helps!