Weird Taxonomic classification 28S rRNA

HugoEira · February 15, 2021, 12:37pm

Hi again,

So I had another go at this. Using SILVA138.1:

This is what I did:
Import data:

Import the Taxonomy Rank file

qiime tools import --type 'FeatureData[SILVATaxonomy]'
--input-path tax_slv_lsu_138.1.txt
--output-path taxranks-silva-138.1-lsu-nr99.qza
Import the Taxonomy Mapping file

qiime tools import --type 'FeatureData[SILVATaxidMap]'
--input-path taxmap_slv_lsu_ref_nr_138.1.txt
--output-path taxmap-silva-138.1-lsu-nr99.qza
Import the Taxonomy Hierarchy Tree file

qiime tools import --type 'Phylogeny[Rooted]'
--input-path tax_slv_lsu_138.1.tre
--output-path taxtree-silva-138.1-nr99.qza
Import the sequence file:

qiime tools import --type 'FeatureData[RNASequence]'
--input-path SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta
--output-path silva-138-ssu-nr99-seqs.qza

Prepare the silva taxonomy prior to use

qiime rescript parse-silva-taxonomy
--i-taxonomy-tree taxtree-silva-138.1-nr99.qza
--i-taxonomy-map taxmap-silva-138.1-lsu-nr99.qza
--i-taxonomy-ranks taxranks-silva-138.1-lsu-nr99.qza
--p-include-species-labels
--p-no-rank-propagation
--o-taxonomy silva-138.1-lsu-nr99-tax.qza

Culling low quality sequences

qiime rescript cull-seqs
--i-sequences silva-138.1-lsu-nr99-seqs.qza
--o-clean-sequences silva-138.1-lsu-nr99-seqs-cleaned.qza

As suggested before I decided not to filter-seqs-length-by-taxon and do it after extracting the amplicon region.

Dereplication of sequences and taxonomy

qiime rescript dereplicate
--i-sequences silva-138.1-lsu-nr99-seqs-cleaned.qza
--i-taxa silva-138.1-lsu-nr99-tax.qza
--p-rank-handles 'silva'
--p-mode 'uniq'
--o-dereplicated-sequences silva-138.1-lsu-nr99-seqs-derep-uniq.qza
--o-dereplicated-taxa silva-138.1-lsu-nr99-tax-derep-uniq.qza

Extract the amplicon region from reference database

qiime feature-classifier extract-reads
--i-sequences silva-138.1-lsu-nr99-seqs-derep-uniq.qza
--p-f-primer GTAACTTCGGGAWAAGGATTGGCT
--p-r-primer AGAGTCAARCTCAACAGGGTCTT
--p-min-length 250 --p-max-length 600
--p-n-jobs 2
--p-read-orientation 'forward'
--o-reads silva-138.1-lsu-nr99-seqs-GA20F-RM9R.qza

After this I visually checked the silva-138.1-lsu-nr99-seqs-GA20F-RM9R.qza also tried evaluate-seqs all sequences are in the 250-600bp range. Should I assume that there are no outliers and thus no further filtering is needed?
Thats what I did..

Dereplicate extracted region

qiime rescript dereplicate
--i-sequences silva-138.1-lsu-nr99-seqs-GA20F-RM9R.qza
--i-taxa silva-138.1-lsu-nr99-tax-derep-uniq.qza
--p-rank-handles 'silva'
--p-mode 'uniq'
--o-dereplicated-sequences silva-138.1-lsu-nr99-seqs-GA20F-RM9R-uniq.qza
--o-dereplicated-taxa silva-138.1-lsu-nr99-tax-GA20F-RM9R-derep-uniq.qza

Build amplicon-region specific classifier

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads silva-138.1-lsu-nr99-seqs-GA20F-RM9R-uniq.qza
--i-reference-taxonomy silva-138.1-lsu-nr99-tax-GA20F-RM9R-derep-uniq.qza
--o-classifier silva-138.1-lsu-nr99-GA20F-RM9R-classifier.qza

A few weird things:

The classifier is only 14Mb. I have built classifiers for 16S and 18S and both are around 170Mb. SILVA LSU database is smaller than SSU does that explain the difference?
This kind of classification:
d__Eukaryota; p__Chytridiomycota; c__Chytridiomycetes; o__Rhizophydiales; f__; g__; s__Rhizophlyctis_rosea
According to the tutorial this empty ranks are normal right?

28S_taxonomy.qza (97.1 KB)

I wanted to upload the classifier and extracted sequences but I cant seem to do it.

Uff sorry for the long post
Cheers,
Hugo