Taxonomic Classification Questions metatranscriptomic data


I have a rather unique type of data set in which my sequences began as metatranscriptomic data containing rRNA and I was able to extract this 16S rRNA computationally using a program that can use SILVA full 16S sequences and separate out the 16S from the rest of the sequences. I have been able to get through the analyses in the “moving pictures” tutorial all the way up to the taxonomic classification. I got really nice PCoA plots showing what I expected to see in my data but when I classify I get a lot of unclassified organisms. I believe this may be due to my data containing multiple 16S variable regions instead of all being v4 for instance and I can not train a classifier to a certain set of primers since 16S primers were not used to create the dataset. Is there a way to create a classifier that will be able to detect all the different variable regions and would the silva database be more useful since it is the most recently updated? For example, can the classifier detect the entire 16S sequence or would a classifier have to be trained using multiple primer seqs?

Thank you,


Hello Zach,

Great question! My first thought is that qiime 2 comes with multiple classifiers, and you should be able to customize them to match your needs. For example, the vsearch based LCA classifier let’s you change the percent identidy to consider a match (80% by default) and also let’s you search in both directions using --p-strand both. What settings did you use? How could you change them to make them a better fit for your shotgun reads.

My second thought is to avoid OTU picking / denoising / dada2 altogether, and just match your reads against SILVA. After all, we know this will work; that’s how they were identified as 16S genes in the first place! Try this script:
The benefit of closed-ref OTU picking is that taxonomic assignment is not needed. You just use the taxonomy of the database.


1 Like

Hi @Zach_Burcham,
What classifier are you using to predict taxonomy (e.g., a pre-trained classifier artifact from this page)? If you are using one of the classifiers trained to V4, that will not work for your sequences. The SILVA full-length 16S classifier should be the one for you. There would be no need to train a classifier with specific classifiers. This should work on your data, if not, something is probably wrong.

However, if that is already what you are doing and you are still getting lots of unclassified seqs, there are a few other options. This level of inaccuracy would actually be quite surprising — either an inappropriate classifier is being used or your method for separating out 16S reads is too permissive and non-16S reads are getting through (or, e.g., including reads that are only partially 16S and partially flanking regions). Before continuing with next steps of troubleshooting, I suppose we first should ask what does “a lot of unclassified organisms” mean? Would you be able to share some results (e.g., taxonomy plots)?

I would check on those first to make sure there is not an error in the data/pre-trained classifier choice… but if that’s still giving trouble then @colinbrislawn’s suggestion to use classify-consensus-vsearch is a good alternative. I like that classifier and it works well (nearly as well as classify-sklearn on amplicon sequences, and possibly better in cases like yours when we cannot train a classifier on a specific sub-domain). If the classify-sklearn method keeps giving you trouble, it is at least worth a try.

I would personally discourage this approach for many reasons:

  1. dada2/deblur are much better at removing noisy sequences than OTU picking (if that is what you are doing… since your data are from shotgun metagenomes it’s unclear if you are performing the same denoising steps on these). Besides, closed-reference clustering is still OTU picking. If you did some sort of dereplication/OTU picking, this would just be retracing your steps and (likely) losing a bit of data (see #3 below)
  2. closed-ref OTU picking is essentially assigning taxonomy from the top alignment to the reference, which is unreliable on fragmentary data. The classifiers in q2-feature-classifier provide much more reliable classifications, since they either perform LCA or confidence prediction when multiple reference taxa align well to the query.
  3. You may lose some reads if they do not align to the reference database within the defined % identity.

Let us know if any of these steps work! If you are still running into trouble, please provide more details on the version of QIIME2, the commands you are using to denoise (or cluster) and classify reads, and details on the classifiers that you are using. Thanks!


Thank you, @colinbrislawn. I will look into the vsearch classifier. Is there a minimum percent identity that you shouldn’t go below?

Hi @Nicholas_Bokulich. I was trying to use the SILVA full length with the sklearn. Pretty much what is in the moving pictures tutorial and am still getting the high numbers of unclassified. I will try to use the vsearch method and see if that improves it and if not then maybe create more stringent parameters for the 16S separation. What results could I share that would help you the most?

It is unclear how much is indicated by “high numbers”, and whether good classifications also exist in your data.

If you have lots of deep (e.g., species-level) classifications AND high numbers, that is a pretty good indicator that everything is probably working fine… you just have lots of sequence artifact/contamination (e.g., transcripts that are being incorrectly classified to 16S) or truncated sequences that cannot be classified. A “second opinion” from the classify-consensus-vsearch classifier would help confirm this. If so, I would just filter out those unclassified sequences and move on (or start over if it looks like the gene classifier is mucking up your results).

taxa barplots and the output of metadata tabulate on your taxonomy artifact.

1 Like

taxa_bar_plots_97silva.qzv (1.7 MB)
taxon_97silva.qzv (2.3 MB)

@Nicholas_Bokulich, here are my files. I had to use 97 clustering for SILVA because of the memory requirement needed.

Well, it’s --p-perc-identity 0.8 by default, and lowering it could help match stranger microbes. But you probably don’t want to make an assignment based on a 20% match. :man_shrugging:

I start with the defaults, and only started experimenting if the results are unexpected or poor.


1 Like

You are getting good, deep classification of most features. There are just a lot of bad apples. As I indicated above, these are probably non-16S sequences or very very short sequences that cannot be classified (some other users have faced that problem).

Since you are doing a tiered classification — identifying 16S reads from among metatranscriptome reads, then classifying those in QIIME2 — I suspect that the first classifier is either misclassifying some reads, or these reads may only be partial 16S (and read into flanking regions outside of the 16S).

One way or another, check out a handful of these unclassified sequences:

  1. Are they shorter than other reads?
  2. What does an NCBI blast search yield?
  3. Can vsearch classify these more specifically?

If these steps indicate what I suspect, then you can probably just toss out all unclassified sequences and move on. Unfortunately, it looks like some samples in particular have very high counts of unclassified (which makes it worth asking: what makes them so special? could be related to sample type) so you may lose some samples in the process. But if these reads really are not 16S, you don’t want them anyway.



Okay I will look into those points. I really appreciate your help with this! On last question about the vsearch classifier, what do I use for the reference reads and taxonomy? Would it just be the SILVA 97 16S fasta file and the majority taxonomy .txt file? Assuming I import them to Qiime artifacts

Yes. I prefer the majority, but some like consensus — it’s a personal choice, but sounds like you’ve found the correct files. I would recommend the 99% OTUs — vsearch should use less memory so it should work.

Yes — you can find the correct importing commands in here.

Good luck!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.