My taxa-bar-plot and taxonomy is not as good as I hoped for. Mostly classification to order level and many sequences only to kingdom level. How can I improve my results? To train the classifier as in the tutorial? I am not sure which artifacts to use besides rep-seqs.qza.
See this post. Depending on the length of your sequences, the primers you are using, etc, deeper than order level might be unreliable. On 250 bp 16S rRNA gene sequences, we can usually get genus level and some species with a good degree of reliability. The naive bayes classifier in QIIME2 is designed to be cautious and reliable, to avoid false-positive errors, but you can lower the confidence parameter to make it less cautious.
If you have:
only kingdom-level classification, it is usually a result of human error (some users have reported this and usually they used the wrong database). That does not sound like your problem.
a mixture of shallow (kingdom) and deep (species) level classification, it is usually an issue with either contaminant DNA present in the samples, or with some sequences being unusually short. See here for ways to diagnose this.
mostly order-level classification, it is probably either characteristic of:
a. the marker gene/primers
b. the database
c. the length of your sequences
d. all of the above.
So, please give us more details. What primers, database, and length of sequences are you using?
How are you currently training your classifier? That could be the issue here. Yes, you should train your classifier following the tutorial (but with the appropriate database) or use one of our pre-trained classifiers for 16S rRNA gene data.
Without the right conditions, taxonomy results can often be disappointing . We all want species, but sometimes our data can only yield so much information. Again, 16S rRNA data with 250 bp reads can usually get genus level reliably and some species, but other marker genes and shorter read lengths can often be much less satisfying. Go through the steps above and let's see if it's possible to improve this in your data.
I have 450 bp long sequences and I am trying to create my own classifier, by using the tutorial train your classifier. However, I receive this error message: (the primers should be correct)
Forward: TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG
Reverse:GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC
insert site is 550 bp and we sequenced 600 bp
Hi @Jasmine84 - I think this error means that your reference reads and your reference taxonomy have no overlapping feature IDs. This can happen if you are accidentally combining files from different reference databases, for example. Have you have a chance to look at this tutorial?
If you need more assistance, please send us links to download the two QZA files used as inputs in that failing command. Thanks!
Yes. I wondered over this when I tried to create my classifier. We have downloadet 99_otus_fasta and 99_otu_taxonomy.txt (green genes database) and created 99_ref-taxonomy.qza and classifier.qza through the tutorial you have linked to. However, my rep-seqs.qza are directly linked to NCBI database.
It looks like you mixed up your files - your reference sequences should be the sequences provided as part of the taxonomic database - not from your own reads. Double-check the Feature classifier tutorial for a solid example - that tutorial shows how to import the greengenes reference reads, and then extract the region of interest.