Lots of unclassified features using 28S SILVA ref database

Hi again,

I have followed the RESCRIPt tutorial to create my own 28S classifier from SILVA 138.1 lsu nr99 database and an amplicon specific classifier. I then used the feature-classifer classify-skylearn and my taxonomy results were not great (about 700 of 1800 features only assigned to Eukaryota).
I have copied my script below:

qiime tools import
–type ‘FeatureData[SILVATaxonomy]’
–input-path tax_slv_lsu_138.1.txt
–output-path taxranks-silva-138.1-lsu-nr99.qza \

qiime tools import
–type ‘FeatureData[SILVATaxidMap]’
–input-path taxmap_slv_lsu_ref_nr_138.1.txt
–output-path taxmap-silva-138.1-lsu-nr99.qza \

qiime tools import
–type ‘Phylogeny[Rooted]’
–input-path tax_slv_lsu_138.1.tre
–output-path taxtree-silva-138.1-nr99.qza \

qiime tools import
–type ‘FeatureData[RNASequence]’
–input-path SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta
–output-path silva-138.1-lsu-nr99-seqs.qza

prepare SILVA taxonomy prior to use

qiime rescript parse-silva-taxonomy
–i-taxonomy-tree taxtree-silva-138.1-nr99.qza
–i-taxonomy-map taxmap-silva-138.1-lsu-nr99.qza
–i-taxonomy-ranks taxranks-silva-138.1-lsu-nr99.qza
–p-include-species-labels
–o-taxonomy silva-138.1-lsu-nr99-tax.qza

Culling - remove sequences that contain 5 or more ambiguous bases

(IUPAC compliant ambiguity bases) and any homopolymers that are 8 or more bases in length.
(default parameters) ##

qiime rescript cull-seqs
–i-sequences silva-138.1-lsu-nr99-seqs.qza
–o-clean-sequences silva-138.1-lsu-nr99-seqs-cleaned.qza

dereplication of sequences and taxonomy

qiime rescript dereplicate
–i-sequences silva-138.1-lsu-nr99-seqs-cleaned.qza
–i-taxa silva-138.1-lsu-nr99-tax.qza
–p-rank-handles ‘silva’
–p-mode ‘uniq’
–o-dereplicated-sequences silva-138.1-lsu-nr99-seqs-derep-uniq.qza
–o-dereplicated-taxa silva-138.1-lsu-nr99-tax-derep-uniq.qza

make amplicon-region specific classifier

qiime feature-classifier extract-reads
–i-sequences silva-138.1-lsu-nr99-seqs-derep-uniq.qza
–p-f-primer GCACCCGCTGAAYTTAAG
–p-r-primer CACGCACTGTTTACTCTC
–p-min-length 100 --p-max-length 600
–p-n-jobs 2
–p-read-orientation ‘forward’
–o-reads silva-138.1-lsu-nr99-seqs-U178f-28Sintr.qza

dereplicate amplicon-region

qiime rescript dereplicate
–i-sequences silva-138.1-lsu-nr99-seqs-U178f-28Sintr.qza
–i-taxa silva-138.1-lsu-nr99-tax-derep-uniq.qza
–p-rank-handles ‘silva’
–p-mode ‘uniq’
–o-dereplicated-sequences silva-138.1-lsu-nr99-seqs-U178f-28Sintr-derep-uniq.qza
–o-dereplicated-taxa silva-138.1-lsu-nr99-tax-U178f-28Sintr-derep-uniq.qza

train and evaluate classifier

qiime rescript evaluate-fit-classifier
–i-sequences silva-138.1-lsu-nr99-seqs-U178f-28Sintr-derep-uniq.qza
–i-taxonomy silva-138.1-lsu-nr99-tax-U178f-28Sintr-derep-uniq.qza
–o-classifier silva-138.1-lsu-nr99-U178f-28Sintr-classifier.qza
–o-observed-taxonomy silva-138.1-lsu-nr99-U178f-28Sintr-predicted-taxonomy.qza
–o-evaluation silva-138.1-lsu-nr99-U178f-28Sintr-fit-classifier-evaluation.qzv

test classifier on your data

qiime feature-classifier classify-sklearn
–i-classifier silva-138.1-lsu-nr99-U178f-28Sintr-classifier.qza
–i-reads rep-seqs.qza
–o-classification taxonomy.qza

qiime metadata tabulate
–m-input-file taxonomy.qza
–o-visualization taxonomy.qzv

I then tried the consensus blast method using my amplicon specific classifier (a) and then just the full classifier (b) as seen below

a) qiime feature-classifier classify-consensus-blast
–i-query rep-seqs.qza
–i-reference-taxonomy silva-138.1-lsu-nr99-tax-U178f-28Sintr-derep-uniq.qza
–i-reference-reads silva-138.1-lsu-nr99-seqs-U178f-28Sintr-derep-uniq.qza
–p-perc-identity 0.9
–p-maxaccepts 1
–o-classification SILVA-138.1-lsu-nr99-paired-end-classification_blast.qza

b) qiime feature-classifier classify-consensus-blast
–i-query rep-seqs.qza
–i-reference-taxonomy silva-138.1-lsu-nr99-tax-derep-uniq.qza
–i-reference-reads silva-138.1-lsu-nr99-seqs-derep-uniq.qza
–p-perc-identity 0.9
–p-maxaccepts 1
–o-classification SILVA-138.1-lsu-nr99-paired-end-classification_blast.qza

Both of these methods still resulted in almost half of the features labelled ‘unclassified’. There were some other strange things that I noticed:

d__Eukaryota; p__Platyhelminthes; c__Trematoda; o__Digenea; f__Digenea; g__Digenea; s__Alaria_americana

d__Eukaryota; p__Platyhelminthes; c__Trematoda; o__Strigeidida; f__Strigeidida; g__Strigeidida; s__Schistosoma_edwardiense

I am not sure why the order, family and genus are all the same. Shistosomes are also Digenea not Strigeidida so something is not right there.

Any thoughts on this would be appreciated

Thanks Anya

Hi @avtober,

Could you please share the results?

It is quite suspicious that even BLAST is failing — and I would dare say that the issue is with your query sequences, not necessarily the reference, because the settings you are using are quite permissive:

That taxonomy is coming directly from SILVA; I just double-checked, see for yourself.

This, on the other hand, is coming from RESCRIPt. RESCRIPt by default will forward-propagate empty taxomic ranks in the SILVA taxonomy, so that a uniform taxonomy is given. This option can be disabled in parse-silva-taxonomy. You can also choose which ranks you want to include in your taxonomy if you want things like subclass etc to be annotated.

Could you share this file here?

I hope that helps!

Hi @Nicholas_Bokulich,

Thank you for your reply, I have attached my results for the sklearn classifier taxonomy2.qzv (1.4 MB) as well as for the consensus blast silva-138.1-lsu-nr99-U178f-28Sintr-fit-classifier-evaluation.qzv (431.2 KB) (these are both using the amplicon specific classifier).

'That taxonomy is coming directly from SILVA; I just double-checked'

ah that is good to know, is there a reason that it just says Strigeidida not Digenea, Strigeidida ?

I have also attached my classifier evaluation here silva-138.1-lsu-nr99-U178f-28Sintr-fit-classifier-evaluation2.qzv (431.2 KB)

As I am looking for parasites, I would expect a few unknown species but its weird that they are not even classifying past eukaryota.

Thanks
Anya

Sorry attached the wrong file, here are the results from the consensus blast SILVA-138.1-lsu-nr99-paired-end-classification_blast.qzv (1.4 MB)

Anya

Because Digenea is the subclass, which is not used by default... as I mentioned above, you can adjust which ranks you want to use with RESCRIPt.

The classifier performs quite well, provided that the organism in question is found in the database. Since you have an extract-reads step you should check out read counts etc before/after extraction to make sure you are not losing too many reference sequences at that stage, same with cull-seqs.

Again, the BLAST results suggest that something is wrong, since you are using very permissive BLAST parameters (take any top hit that is at least 90% similar to the query). So either:

  1. There really are many unknown parasites that you are detecting, or the primers are hitting other euks that are not in the database. (or maybe something like bacterial LSU that is not in the database? I do not know off-hand)
  2. There is something wrong with the reference, e.g., see comments above
  3. There is something wrong with the query, e.g., these are not classifying because they are not actually euks at all.

I recommend using NCBI BLAST to look at some of these unclassified sequences, that would help answer numbers 1 + 3.

Hi @Nicholas_Bokulich, thank you for taking a look and for your advise, I will do some investigating!

Best
Anya

1 Like

Hi @Nicholas_Bokulich
Just looking at your comment below, is there a special way to check this using qiime2?

Thanks
Anya

this is the evaluation output you shared above

RESCRIPt has an evaluate-seqs visualizer that will display read length distribution and N unique sequences (which should be the same thing as total number if your sequence database is dereplicated). Or you can use qiime feature-table tabulate-seqs to get total read counts and length distribution quartiles.

Hi @Nicholas_Bokulich
thank you for that, I tried the feature table tabulate seqs and this worked for my culled sequences (silva-138.1-lsu-nr99-seqs-derep-uniq.qza) yet when I tried it on my unculled un-dereplicated seqs (silva-138.1-lsu-nr99-seqs.qza) to see how many sequences there were initially I got the following error message:

There was a problem loading silva-138.1-lsu-nr99-seqs.qza as a QIIME 2 Result:

No format: RNASequencesDirectoryFormat

I also noticed looking into the SILVA database that some of the sequences I am looking for are not in the refNR99 database but are in the ref database. Would it be ok to make a classifier from just the ref database or is it better to use the NR99 database?

Many thanks
Anya

Sounds like you are inputting RNA sequence data. RESCRIPt has a reverse-transcribe method to convert to DNA.

Short answer: I think either is fine. RESCRIPt supports either. NR99 is much smaller and easier to work with, also has been further curated by the SILVA team so that’s the one I prefer to use. @SoilRotifer and I have done some comparisons between the two (using RESCRIPt) and as far as I recall NR99 does look better, but the full ref does not look “bad”.

1 Like

Thanks I will have a go!

Best
Anya

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.