Feature classifier: bad sequence assignment

Hi,
I am working with some 45 sample +mock community for 16s analysis. My goal is to identify abundonce differences in a given taxa level between my groups. But when I run the q2-feature classifier plugin (after I did the training classifier) The final results I got with a taxa barplot, I realize that I have a very bad taxonomic assignation either with greengenes or silva. For example I have positive samples in bacteriology detection for listeria but it is negative in 16s, also in the mock community i noticed somme genus that does not exsit effectively.
As I did the same analysis in the past with another methods, I am very surprised how it gives this kind of results, likely I did something wrong (I am a new user for these analysis).
I did the p-trunc at 240pb in my paired end reads when I run dada2, do you think it could affect the taxonomic classification?
Here are all my commands

qiime tools import
--type 'FeatureData[Sequence]'
--input-path 99_otus_16S.fasta
--output-path 99_otus_16S.qza

qiime tools import
--type 'FeatureData[Taxonomy]'
--source-format HeaderlessTSVTaxonomyFormat
--input-path consensus_taxonomy_all_levels.txt
--output-path ref-taxonomy.qza
qiime feature-classifier extract-reads
--i-sequences 99_otus_16S.qza
--p-f-primer GTGCCAGCMGCCGCGGTAA
--p-r-primer GGACTACHVGGGTWTCTAAT
--o-reads ref-seqs.qza
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ref-seqs.qza
--i-reference-taxonomy ref-taxonomy.qza
--o-classifier classifier.qza

qiime feature-classifier classify-sklearn
--i-classifier classifier.qza
--i-reads rep-seqs.qza
--o-classification taxonomy.qza

qiime metadata tabulate
--m-input-file taxonomy.qza
--o-visualization taxonomy.qzv

And here is the tsv table
level-7 (3).tsv (1.2 KB)
Many thanks!

2 Likes

Hi @biotama! A couple of things come to mind:

  • You’re using the primers listed in the q2-feature-classifier tutorial to extract reference reads. Are those the appropriate primers for your mock community sequences? The primers used in the tutorial are the 515F/806R primer pair, which covers V4.

  • Are the representative sequences from the mock communities (i.e. rep-seqs.qza) generated from single-end or paired-end data? If they came from single-end data, you’ll want to use --p-trunc-len with qiime feature-classifier extract-reads in order to have the extracted reference sequences match the length of your mock community sequences. If the representative sequences were generated from paired-end data, avoid using --p-trunc-len.

Hi jairideout!
Yes I use these primers for all my 16s amplification samples including the DNA of the mock community, so I use it in qiime2 analysis.
I did the paired end sequencing for all samples (with mock).
I run all the analysis without trimming from the dada 2 to the extract reads in q2-feature-classifier, its seems be better for mock community with silva database, is the probleme in the ten nucleotides truncated in dada2? is dada2 do correction with the paired end reads?
I wondering also if we can use also the RDP database in qiime2 for taxonomy assignment?

Thanks a lot!

Species-level classification is tough with short 16S reads like V4. The classifiers in QIIME2 are tuned to maximize precision (i.e., reduce false-positive errors) at the expense of classification depth where possible. That is, if a reliable classification cannot be achieved at species level, the query sequence will only be classified to genus level. The quality of the reference databases can be a part of this problem if noisy data are present.

There may be options to improve this. If you want to increase classification depth (at the risk of increasing false-positive errors), you can adjust your methods/parameters as described in this preprint. See the parameter settings described for "high-recall" classifiers. Since you have a mock community in your run, you are in a good place to fiddle with your parameters and decide what works best for your data.

This is very common with marker-gene data. Bacteriological detection of listeria (to my knowledge) commonly relies on some form of enrichment, which will increase sensitivity above what 16S profiling (without enrichment) can provide — or are you using a non-enrichment method?

Listeria may also be difficult to distinguish from other Bacillales, and hence you will get a lower classification depth (the mock community contains a feature assigned to Bacillales without further classification). See advice above to improve this depth.

This is most likely a contaminant, particularly if you are detecting it with different reference databases/methods. Mock communities are never perfect, and contaminants and/or PCR/sequencing error can greatly skew the results, so you should never expect mock communities to look perfect. False-positive errors are still a threat simply because classification of short 16S reads against potentially noisy reference data is a tough computational problem.

The 10 nucleotides in the dada2 tutorial is just an example — you do not need to and probably should not use the same settings for your own data. The trim-left parameter is mostly used to remove primers from your sequences — make sure these are removed from your sequences! If your primers are already removed, you probably do not need to trim and this setting is not responsible for the issues that you are seeing.

yes.

You can use any database in the required format — so see how the SILVA and Greengenes databases are formatted and RDP may already be in the required format or you may be able to reformat following these examples. As far as I know, RDP only provides genus-level information (not species), so it will probably not improve your classification depth beyond what you already have.

Good luck!

Hi!
Thanks for these calrifications, i would just bring some prcisions

I didn't use the example but my choice was fixed by the barplot in the feature table of my reads instead.

Yes, thanks for all!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.