Hi again,
I have followed the RESCRIPt tutorial to create my own 28S classifier from SILVA 138.1 lsu nr99 database and an amplicon specific classifier. I then used the feature-classifer classify-skylearn and my taxonomy results were not great (about 700 of 1800 features only assigned to Eukaryota).
I have copied my script below:
qiime tools import
--type 'FeatureData[SILVATaxonomy]'
--input-path tax_slv_lsu_138.1.txt
--output-path taxranks-silva-138.1-lsu-nr99.qza \
qiime tools import
--type 'FeatureData[SILVATaxidMap]'
--input-path taxmap_slv_lsu_ref_nr_138.1.txt
--output-path taxmap-silva-138.1-lsu-nr99.qza \
qiime tools import
--type 'Phylogeny[Rooted]'
--input-path tax_slv_lsu_138.1.tre
--output-path taxtree-silva-138.1-nr99.qza \
qiime tools import
--type 'FeatureData[RNASequence]'
--input-path SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta
--output-path silva-138.1-lsu-nr99-seqs.qza
prepare SILVA taxonomy prior to use
qiime rescript parse-silva-taxonomy
--i-taxonomy-tree taxtree-silva-138.1-nr99.qza
--i-taxonomy-map taxmap-silva-138.1-lsu-nr99.qza
--i-taxonomy-ranks taxranks-silva-138.1-lsu-nr99.qza
--p-include-species-labels
--o-taxonomy silva-138.1-lsu-nr99-tax.qza
Culling - remove sequences that contain 5 or more ambiguous bases
(IUPAC compliant ambiguity bases) and any homopolymers that are 8 or more bases in length.
(default parameters) ##
qiime rescript cull-seqs
--i-sequences silva-138.1-lsu-nr99-seqs.qza
--o-clean-sequences silva-138.1-lsu-nr99-seqs-cleaned.qza
dereplication of sequences and taxonomy
qiime rescript dereplicate
--i-sequences silva-138.1-lsu-nr99-seqs-cleaned.qza
--i-taxa silva-138.1-lsu-nr99-tax.qza
--p-rank-handles 'silva'
--p-mode 'uniq'
--o-dereplicated-sequences silva-138.1-lsu-nr99-seqs-derep-uniq.qza
--o-dereplicated-taxa silva-138.1-lsu-nr99-tax-derep-uniq.qza
make amplicon-region specific classifier
qiime feature-classifier extract-reads
--i-sequences silva-138.1-lsu-nr99-seqs-derep-uniq.qza
--p-f-primer GCACCCGCTGAAYTTAAG
--p-r-primer CACGCACTGTTTACTCTC
--p-min-length 100 --p-max-length 600
--p-n-jobs 2
--p-read-orientation 'forward'
--o-reads silva-138.1-lsu-nr99-seqs-U178f-28Sintr.qza
dereplicate amplicon-region
qiime rescript dereplicate
--i-sequences silva-138.1-lsu-nr99-seqs-U178f-28Sintr.qza
--i-taxa silva-138.1-lsu-nr99-tax-derep-uniq.qza
--p-rank-handles 'silva'
--p-mode 'uniq'
--o-dereplicated-sequences silva-138.1-lsu-nr99-seqs-U178f-28Sintr-derep-uniq.qza
--o-dereplicated-taxa silva-138.1-lsu-nr99-tax-U178f-28Sintr-derep-uniq.qza
train and evaluate classifier
qiime rescript evaluate-fit-classifier
--i-sequences silva-138.1-lsu-nr99-seqs-U178f-28Sintr-derep-uniq.qza
--i-taxonomy silva-138.1-lsu-nr99-tax-U178f-28Sintr-derep-uniq.qza
--o-classifier silva-138.1-lsu-nr99-U178f-28Sintr-classifier.qza
--o-observed-taxonomy silva-138.1-lsu-nr99-U178f-28Sintr-predicted-taxonomy.qza
--o-evaluation silva-138.1-lsu-nr99-U178f-28Sintr-fit-classifier-evaluation.qzv
test classifier on your data
qiime feature-classifier classify-sklearn
--i-classifier silva-138.1-lsu-nr99-U178f-28Sintr-classifier.qza
--i-reads rep-seqs.qza
--o-classification taxonomy.qza
qiime metadata tabulate
--m-input-file taxonomy.qza
--o-visualization taxonomy.qzv
I then tried the consensus blast method using my amplicon specific classifier (a) and then just the full classifier (b) as seen below
a) qiime feature-classifier classify-consensus-blast
--i-query rep-seqs.qza
--i-reference-taxonomy silva-138.1-lsu-nr99-tax-U178f-28Sintr-derep-uniq.qza
--i-reference-reads silva-138.1-lsu-nr99-seqs-U178f-28Sintr-derep-uniq.qza
--p-perc-identity 0.9
--p-maxaccepts 1
--o-classification SILVA-138.1-lsu-nr99-paired-end-classification_blast.qza
b) qiime feature-classifier classify-consensus-blast
--i-query rep-seqs.qza
--i-reference-taxonomy silva-138.1-lsu-nr99-tax-derep-uniq.qza
--i-reference-reads silva-138.1-lsu-nr99-seqs-derep-uniq.qza
--p-perc-identity 0.9
--p-maxaccepts 1
--o-classification SILVA-138.1-lsu-nr99-paired-end-classification_blast.qza
Both of these methods still resulted in almost half of the features labelled 'unclassified'. There were some other strange things that I noticed:
d__Eukaryota; p__Platyhelminthes; c__Trematoda; o__Digenea; f__Digenea; g__Digenea; s__Alaria_americana
d__Eukaryota; p__Platyhelminthes; c__Trematoda; o__Strigeidida; f__Strigeidida; g__Strigeidida; s__Schistosoma_edwardiense
I am not sure why the order, family and genus are all the same. Shistosomes are also Digenea not Strigeidida so something is not right there.
Any thoughts on this would be appreciated
Thanks Anya