Improving taxonomic classification

Hello everyone,

My colleagues and I are doing the analysis of some low biomass samples with Qiime 2 (v2021.4). It's the first time that we do a microbiota analysis and we have found some strange results. In our Feature table we have a total of 56 samples, 4 commercial mock community replicas and some negative control samples. We have trained the classifier with the SILVA database following the instructions described in "Moving Pictures" and "Parkinson's Mouse" tutorials and then we have taxonomically classified all the samples.

We have started by analyzing the samples of the mock community to check if the analysis has gone well and we have found that approximately 20% of the reads are unassigned and almost 30% are assigned in the domain taxon (d_Bacteria), which for practical purposes it have not been classified either. Thus, we have half of the reads unclassified even at the phylum level. The commercial mock community does not have low biomass so we expected this analysis to be more accurate than for our samples. We have also reviewed the samples and obtained similar results, with 50% unassigned.

Alternatively, we have used Kraken software available in Illumina BaseSpace to check if the problem is our lab processing or the bioinformatics analysis. This software has given us the proportion of bacteria described by the manufacturer of the community mock, so we think that we are doing something wrong in qiime 2. That's the code used:

qiime feature-classifier extract-reads
--i-sequences silva-138-99-seqs.qza
--p-min-length 200
--p-max-length 600
--o-reads ref-seqs.qza
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ref-seqs.qza
--i-reference-taxonomy ref-taxonomy.qza
--o-classifier classifier.qza
qiime feature-classifier classify-sklearn
--i-classifier classifier.qza
--i-reads rep-seqs.qza
--o-classification taxonomy.qza

How can we improve our analysis to obtain a higher number of assignments?

Thank you very much for your help!

Welcome to the forum, @Sergio_Garcia_Segura!

This often indicates that your sequences are in mixed orientation (both forward and reverse together). Take a look at the RESCRIPt tutorial. The orient-seqs command will align your sequences to a reference, and may take care of the issue for you. Let us know how it goes!



An off-topic reply has been split into a new topic: trouble with rescript

Please keep replies on-topic in the future.

Hi @ChrisKeefe!

We have tried to orient the sequences with the RESCRIPt tutorial that you have suggested and it seems that the problem has largely diminished. Now, the unclassified are only 1% and those classified as ‘k_Bacteria’ are approximately 5.5%. It’s a great improvement over the previous situation!

Even so, the results are not exactly as expected in the mock community, especially for some genera such as Lactobacillus, which is very underrepresented. Although, we believe that it may be classifications that have not reached the genus level because, for example, at class level we have almost 10% of Bacilli that the classifier has not been able to put down in taxonomic tree. There are similar ones at class or order level.

On the other hand, we have also had the doubt that if it is true that the sequences are in mixed orientation then our diversity analisys (with core-metrics-phylogenetic) have been compromised, right? We have to recalculate the diversity metrics?



Glad to hear that helped!

Not sure if there's a question here - if there is, please open a new topic for it.

Are you asking whether your original diversity analyses should be re-run, now that you've re-oriented your ref-seqs?

Don't worry, I have some ideas to try to improve it. If I get in trouble I will open a new topic.

Yes, please. I'm not sure if diversity metrics could be affected by orientation because I don't know how the algorithm works.

I am very grateful for your help!

1 Like

Your diversity metrics have been impacted, and you should re-orient, dereplicate, and then re-run. Two examples for you to consider:

observed-features is a count of the number of unique features in your data. If a feature is present in your data in both "forward" and "reversed" orientation, it would show up twice, potentially doubling the alpha diversity on this metric.

Any phylogenetic measure (e.g. Unifrac, Faith's PD) will be based on a Phylogeny, almost certainly constructed from your representative sequences. Phylogeny builders won't know some of those sequences are "mis-oriented", and this could result in the construction of a tree that is not meaningful.

So, after the orient-seqs step and before you re-run, you should also dereplicate your ref-seqs and table. You can do this by clustering your features de novo at 100% identity (cluster-features-de-novo can do this for you). This will ensure that any duplicates in your re-aligned rep-seqs are removed, and their counts are combined in your feature table. Once this is done, you should be able to rebuild a more meaningful Phylogeny, and rerun core-metrics-phylogenetic and any downstream analysis.



Your explanation has helped me better understand the process and my mental scheme of how the sequences are in my samples. You have been a great help, thank you very much for everything! We will implement what you just suggested. I feel that now we are closer to having a satisfactory analysis.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.