over 70 % of reads are assigned as d__Bacteria;__;__;__;__;__;__ qiime feature-classifier classify-sklearn

Topics: over70 % of reads are assigned as d__Bacteria;;;;;; qiime feature-classifier classify-sklearn

Hello Qiime2 team:
I try to analyze sequencing results from several body fluid samples using 16S V1+V2 region (68F:TAACACATGCAAGTCRACTYGA/338R:GCTGCCTCCCGTAGGAGT) using qiime2-amplicon-2024.2 on a Ubuntu server. The paired-end reads were filtered and merged using
qiime dada2 denoise-paired
--i-demultiplexed-seqs demux_paired-end.qza
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 150
--p-trunc-len-r 150 \

The output was then filtered to remove non-16S seq and reads with lengh < 230bp

remove non-16S

qiime quality-control exclude-seqs
--i-query-sequences rep-seqs-dada2.qza
--i-reference-sequences silva_otus.qza
--p-method blast
--p-perc-identity 0.6
--p-perc-query-aligned 0.9
--p-threads 12
--o-sequence-hits Silva-blast-hits-60-90.qza
--o-sequence-misses Silva-blast-misses-60-90.qza
--verbose

qiime feature-table filter-features
--i-table table-dada2.qza
--m-metadata-file Silva-blast-misses-60-90.qza
--o-filtered-table no-Silva-blast-misses-table-60-90-dada2.qza
--p-exclude-ids

cp Silva-blast-hits-60-90.qza rep-seqs.qza
cp no-Silva-blast-misses-table-60-90-dada2.qza table.qza

#filter rep-seqs.qza based on length

qiime feature-table filter-seqs
--i-data rep-seqs.qza
--m-metadata-file rep-seqs.qza
--p-where 'length(sequence) > 230'
--o-filtered-data rep-seqs-over230.qza

qiime feature-table filter-features
--i-table table-dada2.qza
--m-metadata-file rep-seqs-less230.qza
--o-filtered-table no-Silva-blast-misses-table-60-90-less230-dada2.qza
--p-exclude-ids

cp no-Silva-blast-misses-table-60-90-less230-dada2.qza table.qza

taxonomy assignment

qiime feature-classifier classify-sklearn
--i-classifier sliva-138.1-ssu-nr99-V1-V2-classifier.qza
--i-reads rep-seqs-over230.qza
--p-read-orientation auto
--o-classification silva-taxonomy-138-99-V1-V2-over230.qza

qiime metadata tabulate
--m-input-file silva-taxonomy-138-99-V1-V2-over230.qza
--o-visualization silva-taxonomy-138-99-V1-V2-over230.qzv

qiime feature-table filter-features
--i-table table.qza
--m-metadata-file silva-taxonomy-138-99-V1-V2-over230.qza
--o-filtered-table id-filtered-table.qza

qiime taxa barplot
--i-table id-filtered-table.qza
--i-taxonomy silva-taxonomy-138-99-V1-V2-over230.qza
--m-metadata-file sample-metadata.tsv
--o-visualization silva-taxa-bar-plots-138-99-V1-V2-over230.qzv

The result shows that over 70% of the reads could not be detailed assigned after level_1 and shows d__Bacteria;;;;;;

Is there any possible reason for this? Any input will be highly appreciated.

rep-seqs-over230.qza, table.qza and sample-metadata.tsv file are uploaded in attachment files for evaluation.

Best,

Jrhau Lung
table.qza (243.4 KB)
rep-seqs-over230.qza (57.4 KB)
sample-metadata.tsv (325 Bytes)

2 Likes

Hi @jrhaulung,

Two things,

It could be that the premade curated database does not contain the V1V2 sequences, as they've been excluded during strict curation, due to being at the end of the sequence. Which means many reference sequences may have been removed. You can make your own database as outlined here.

Otherwise I suspect a read orientation issue... if you search the forum, you'll see many solutions to this problem. I'd suggest starting with this approach.

3 Likes

Did you try another classifier?
You can try using MIMt that had every sequence at all levels so if you find a match you will get almost for sure at least at genus level. In addition all sequences on it are full length so it contains all variable regions from 16S and they are curated from Refseq.
Can be found at mimt.bu.biopolis.pt/downloads

Hope it helps.

1 Like

Hi SoilRotifer and A_Munoz Antonio
Thanks for your help. I retrain the 16S V1-V2 classifier using the following commands with dataset download from Silva,

qiime feature-classifier extract-reads
--i-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza
--p-f-primer AACACATGCAAGTCRACTYGA
--p-r-primer GCTGCCTCCCGTAGGAGT
--p-n-jobs 10
--p-read-orientation 'forward'
--o-reads silva-138.1-ssu-nr99-seqs-V1-V2.qza

qiime rescript dereplicate
--i-sequences silva-138.1-ssu-nr99-seqs-V1-V2.qza
--i-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza
--p-threads 70
--p-mode 'uniq'
--o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-V1-V2-uniq.qza
--o-dereplicated-taxa silva-138.1-ssu-nr99-tax-V1-V2-derep-uniq.qza

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads silva-138.1-ssu-nr99-seqs-V1-V2-uniq.qza
--i-reference-taxonomy silva-138.1-ssu-nr99-tax-V1-V2-derep-uniq.qza
--o-classifier sliva-138.1-ssu-nr99-V1-V2-classifier.qza

Unfortunately, the taxonomy assignment for the rep-seqs-over230.qza using both full length 16S classifier and V1-V2 classifier obtained similar result and over 70% of sequence still reads are assigned as d__Bacteria;;;;;; .

Read orient issue also checked using

qiime rescript orient-seqs
--i-sequences rep-seqs-over230.qza
--p-threads 10
--o-oriented-seqs rep-seqs-over230-orient.qza
--o-unmatched-seqs rep-seqs-over230-non-orient.qza

qiime feature-classifier classify-sklearn
--i-classifier sliva-138.1-ssu-nr99-V1-V2-classifier.qza
--i-reads rep-seqs-over230-orient.qza
--p-read-orientation auto
--o-classification silva-taxonomy-138-99-V1-V2-over230-orient.qza

qiime feature-table filter-features
--i-table table.qza
--m-metadata-file silva-taxonomy-138-99-V1-V2-over230-orient.qza
--o-filtered-table id-filtered-table-orient.qza

qiime taxa barplot
--i-table id-filtered-table-orient.qza
--i-taxonomy silva-taxonomy-138-99-V1-V2-over230-orient.qza
--m-metadata-file sample-metadata.tsv
--o-visualization silva-taxa-bar-plots-138-99-V1-V2-over230-orient.qzv

qiime tools view silva-taxa-bar-plots-138-99-V1-V2-over230-orient.qzv

But the result has no improvement.

Best,

Jrhau

So in this case, I see two possible scenarios:

  1. your sample contains species very unknown and they are not similar to anything de databases contain
  2. This is the most probable. The V1-V2 region is not good to differentiate individuals and when your sequences are similar to various entries and they belong to different taxa (i.e. genus, families and so on) they get the taxonomy level of the previous where they coincide, and this seems to be just Bacteria.

You can check "Comprehensive Assessment of 16S rRNA Gene Amplicon Sequencing for Microbiome Profiling across Multiple Habitats" from Zheng et al. 2023 to see the efficiency in classifying for different regions and different databases.

Best regards

1 Like

Hi @jrhaulung,

It appears the SILVA does not have many representatives for Paramecia. If you go here and search for Paramecium bursaria you'll see that there are ~ 29 entries. But if you search for Paramecium you'll observe 262 references. Many of them quite short and may not span the V1V2 region.

I downloaded your data and trained the calssifier on the full SILVA 138.2 database, and essentially observed the same results. Given the lack of references I mentioned above and the many hits I obtain with GenBank, perhaps make your own database from GenBank? Many of your reads hit quite well there.

You can modify the above tutorials as needed to pull down the taxonomic groups / gene sequences you are interested in.

-Mike

3 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.