Hi again,
So I had another go at this. Using SILVA138.1:
This is what I did:
Import data:
-
Import the Taxonomy Rank file
qiime tools import --type 'FeatureData[SILVATaxonomy]'
--input-path tax_slv_lsu_138.1.txt
--output-path taxranks-silva-138.1-lsu-nr99.qza -
Import the Taxonomy Mapping file
qiime tools import --type 'FeatureData[SILVATaxidMap]'
--input-path taxmap_slv_lsu_ref_nr_138.1.txt
--output-path taxmap-silva-138.1-lsu-nr99.qza -
Import the Taxonomy Hierarchy Tree file
qiime tools import --type 'Phylogeny[Rooted]'
--input-path tax_slv_lsu_138.1.tre
--output-path taxtree-silva-138.1-nr99.qza -
Import the sequence file:
qiime tools import --type 'FeatureData[RNASequence]'
--input-path SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta
--output-path silva-138-ssu-nr99-seqs.qza
Prepare the silva taxonomy prior to use
qiime rescript parse-silva-taxonomy
--i-taxonomy-tree taxtree-silva-138.1-nr99.qza
--i-taxonomy-map taxmap-silva-138.1-lsu-nr99.qza
--i-taxonomy-ranks taxranks-silva-138.1-lsu-nr99.qza
--p-include-species-labels
--p-no-rank-propagation
--o-taxonomy silva-138.1-lsu-nr99-tax.qza
Culling low quality sequences
qiime rescript cull-seqs
--i-sequences silva-138.1-lsu-nr99-seqs.qza
--o-clean-sequences silva-138.1-lsu-nr99-seqs-cleaned.qza
As suggested before I decided not to filter-seqs-length-by-taxon
and do it after extracting the amplicon region.
Dereplication of sequences and taxonomy
qiime rescript dereplicate
--i-sequences silva-138.1-lsu-nr99-seqs-cleaned.qza
--i-taxa silva-138.1-lsu-nr99-tax.qza
--p-rank-handles 'silva'
--p-mode 'uniq'
--o-dereplicated-sequences silva-138.1-lsu-nr99-seqs-derep-uniq.qza
--o-dereplicated-taxa silva-138.1-lsu-nr99-tax-derep-uniq.qza
Extract the amplicon region from reference database
qiime feature-classifier extract-reads
--i-sequences silva-138.1-lsu-nr99-seqs-derep-uniq.qza
--p-f-primer GTAACTTCGGGAWAAGGATTGGCT
--p-r-primer AGAGTCAARCTCAACAGGGTCTT
--p-min-length 250 --p-max-length 600
--p-n-jobs 2
--p-read-orientation 'forward'
--o-reads silva-138.1-lsu-nr99-seqs-GA20F-RM9R.qza
After this I visually checked the silva-138.1-lsu-nr99-seqs-GA20F-RM9R.qza
also tried evaluate-seqs
all sequences are in the 250-600bp range. Should I assume that there are no outliers and thus no further filtering is needed?
Thats what I did..
Dereplicate extracted region
qiime rescript dereplicate
--i-sequences silva-138.1-lsu-nr99-seqs-GA20F-RM9R.qza
--i-taxa silva-138.1-lsu-nr99-tax-derep-uniq.qza
--p-rank-handles 'silva'
--p-mode 'uniq'
--o-dereplicated-sequences silva-138.1-lsu-nr99-seqs-GA20F-RM9R-uniq.qza
--o-dereplicated-taxa silva-138.1-lsu-nr99-tax-GA20F-RM9R-derep-uniq.qza
Build amplicon-region specific classifier
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads silva-138.1-lsu-nr99-seqs-GA20F-RM9R-uniq.qza
--i-reference-taxonomy silva-138.1-lsu-nr99-tax-GA20F-RM9R-derep-uniq.qza
--o-classifier silva-138.1-lsu-nr99-GA20F-RM9R-classifier.qza
A few weird things:
-
The classifier is only 14Mb. I have built classifiers for 16S and 18S and both are around 170Mb. SILVA LSU database is smaller than SSU does that explain the difference?
-
This kind of classification:
d__Eukaryota; p__Chytridiomycota; c__Chytridiomycetes; o__Rhizophydiales; f__; g__; s__Rhizophlyctis_rosea
According to the tutorial this empty ranks are normal right?
28S_taxonomy.qza (97.1 KB)
I wanted to upload the classifier and extracted sequences but I cant seem to do it.
Uff sorry for the long post
Cheers,
Hugo