Weird Taxonomic classification 28S rRNA

HugoEira · February 12, 2021, 12:10pm

Hello,

First and foremost thanks a lot for this amazing forum. This is a great resource.

I am new to qiime2 an I'm playing around with taxonomic classification of bird cloacal microbiota using the 28S rRNA (D4-D6 region; among others).

I created a primmer region specific classifier, using the RESCRIPT plugin and following the tutorial provided here . I used SILVA132.

Results of taxonomic classification seem to make sense, apart from one weird classification:
d__Eukaryota; p__Eukaryota; c__Eukaryota; o__Eukaryota; f__Eukaryota; g__Eukaryota

I was expecting in this case that the classification would be just:
d_Eukaryota

In the results I get both:
d__Eukaryota; p__Eukaryota; c__Eukaryota; o__Eukaryota; f__Eukaryota; g__Eukaryota
and
d_Eukaryota

Suppose this is a problem with the classifier? Is SILVA the best database for this? Any ideas?

This is what I did:

Get SILVA data
qiime rescript get-silva-data --p-version '132' --p-target 'LSURef' --p-include-species-labels --o-silva-sequences silva-132-LSU-nr99-seqs.qza --o-silva-taxonomy silva-132-LSU-nr99-tax.qza
Culling low quality sequences
qiime rescript cull-seqs --i-sequences silva-132-LSU-nr99-seqs.qza --o-clean-sequences silva-132-LSU-nr99-seqs-cleaned.qza
Filtering sequences by length and taxonomy
qiime rescript filter-seqs-length-by-taxon --i-sequences silva-132-LSU-nr99-seqs-cleaned.qza --i-taxonomy silva-132-LSU-nr99-tax.qza --p-labels Archaea Bacteria Eukaryota --p-min-lens 900 1000 1200 --o-filtered-seqs silva-138-ssu-nr99-seqs-filt.qza --o-discarded-seqs silva-138-ssu-nr99-seqs-discard.qza
Dereplication of sequences and taxonomy
qiime rescript dereplicate --i-sequences silva-132-LSU-nr99-seqs-filt.qza --i-taxa silva-132-LSU-nr99-tax.qza --p-rank-handles 'silva' --p-mode 'uniq' --o-dereplicated-sequences silva-132-LSU-nr99-seqs-derep-uniq.qza --o-dereplicated-taxa silva-132-LSU-nr99-tax-derep-uniq.qza
Extract reads on primer region
qiime feature-classifier extract-reads --i-sequences silva-132-LSU-nr99-seqs-derep-uniq.qza --p-f-primer GTAACTTCGGGAWAAGGATTGGCT --p-r-primer AGAGTCAARCTCAACAGGGTCTT --p-min-length 250 --p-max-length 600 --p-n-jobs 2 --p-read-orientation 'forward' --o-reads silva-132-LSU-nr99-seqs-GA20F-RM9R.qza
Dereplicate extracted region
qiime rescript dereplicate --i-sequences silva-132-LSU-nr99-seqs-GA20F-RM9R.qza --i-taxa silva-132-LSU-nr99-tax-derep-uniq.qza --p-rank-handles 'silva' --p-mode 'uniq' --o-dereplicated-sequences silva-132-LSU-nr99-seqs-GA20F-RM9R-uniq.qza --o-dereplicated-taxa silva-132-LSU-nr99-tax-GA20F-RM9R-derep-uniq.qza
Build amplicon-region specific classifier
qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads silva-132-LSU-nr99-seqs-GA20F-RM9R-uniq.qza --i-reference-taxonomy silva-132-LSU-nr99-tax-GA20F-RM9R-derep-uniq.qza --o-classifier silva-132-LSU-nr99-GA20F-RM9R-classifier.qza

Kind regards,
Hugo Pereira

SoilRotifer · February 12, 2021, 3:10pm

Hi @HugoEira, welcome to !

I think you are one of the first to inform us about using RESCRIPt for SILVA LSU data!

If you use your LSU classifier to classify your reads, and observe as the final classification that the upper level taxonomy is propagated downward through all of the ranks, then this means that you scored a hit to a specific sequence in the database that had no information other than 'Eukaryota'. That is, during the initial steps of constructing your database, either through using the pipeline get-silva-data or the action parse-silva-taxonomy, the default setting is to propagate taxonomy with --p-rank-propagation. This ensures there are no empty ranks in the output for some use cases and tools that do not like empty ranks. This can be disabled by using the flag --p-no-rank-propagation to obtain the d__Eukaryota only label. That is, if the resulting classification returns a taxonomy that is not truncated as in the second case, then the classifier is simply returning a specific hit (or hits) to a sequence (or sequences) that all contain the full d__Eukaryota; ... g__Eukaryota string. That is, you may want to consider removing any sequences from your reference database, prior to making the classifier, that do not have at least a phylum or other taxonomy information, as they are not particularly helpful.

So, in these cases you are observing two different classification results. The first is the case that we discussed above. The second, where you only observe d__Eukaryota, is due to the fact that the classifier could not identify, beyond the domain level, what your query sequence was. That is, there could have been several equivalent hits to different Eukaryotes within the database (i.e with different taxonomies), and only the lowest common ancestor taxonomy was returned. That is all the lower ranks are cropped except for the upper most rank that it had reasonable confidence.

Hopefully this makes sense. Anyway you are on the right track!

Here are some additional tips worth considering:

I would recommend playing around with the sequence lengths for this command. The numeric values were tailored to the 16S / 18S rRNA gene data. Since LSU is larger you may want to consider increasing these? I've not benchmarked these, but just something work considering.

You can also pick and choose which ranks you'd like in your classifier too. For more details see the tutorial, and this thread:

Also, do not forget to read our warning about --p-include-species-labels within the Species-labels: caveat emptor! drop menu in the tutorial.

-Good luck and keep us posted!
-Mike

HugoEira · February 12, 2021, 3:54pm

Hi @SoilRotifer,

Thanks so much for the quick reply.

Ok that makes things more clear, and some other results easier to understand I will try it with --p-no-rank-propagation

True, didn't really think about it. Don't really know how to do an informed decision on this.
Taking into account average sizes of this marker in different databases (very empirical knowledge) I probably can increase it to --p-min-lens 1000 1500 2000 or even --p-min-lens 1500 2000 2500.
Any input on this?

At first I wanted to do it with SILVA138 but the "easier way" does not work:
qiime rescript get-silva-data --p-version ‘138’ --p-target ‘LSURef’ --p-include-species-labels --o-silva-sequences silva-138-LSU-nr99-seqs.qza --o-silva-taxonomy silva-138-LSU-nr99-tax.qza

I am still trying to figure it out which docs to download to make it the "hard way".

I'll let you know if these tweaks made it better

Thanks so much,
Hugo

SoilRotifer · February 12, 2021, 4:23pm

I can't really say, it is largely dependent on the sequence lengths present within the SILVA database. You can also simply filter everything to the same length too via filter-seqs-length. Also, one option is to not filter based on length at all until after you've extracted your amplicon region first. That is, you may remove short "full-length" sequences which actually contain your complete amplicon region. We allude to this in the tutorial:

Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt

qiime feature-classifier extract-reads \
    --i-sequences silva-138-ssu-nr99-seqs-derep-uniq.qza \
    --p-f-primer GTGYCAGCMGCCGCGGTAA \
    --p-r-primer GGACTACNVGGGTWTCTAAT \
    --p-n-jobs 2 \
    --p-read-orientation 'forward' \
    --o-reads silva-138-ssu-nr99-seqs-515f-806r.qza
Note / Tip : Depending on your goals, it may also be reasonable to use the raw imported sequences, or output from either cull-seqs or filter-seqs-length[-by-taxon] as input into the above extract-reads command. Be aware that you may need to run reverse-transcribe prior to other steps outlined in this tutorial.

As for your next question...

The SILVA 138 LSU was not available at the time we developed this code. However, I see that SILVA now has provided LSU 138 data for us to pull down. We'll work on updating get-silva-data. Note: this pipeline simply wraps all of the steps outlined in the 'hard way' approach. The steps outline under the "hard way", is essentially our "escape hatch" for cases like this.

It seems like the have two folders: 138_1 and 138.1.
I think you'd want to go to the 138.1 folder and download the following:

These match the file naming schema outlined in the tutorial.

Hopefully, there are no formatting issues with these SILVA 138.1 LSU files. Please let us know how it goes.

-Mike

Nicholas_Bokulich · February 12, 2021, 4:33pm

you could use the evaluate-seqs method to look at the length distribution. This will not of course tell you what length range is "correct" (the literature is probably a better place for that info), but it might give you a sense of any clear outliers.

HugoEira · February 12, 2021, 5:15pm

Thank you so much.
I'll try it out and keep you updated on progress.

Cheers,
Hugo

HugoEira · February 15, 2021, 12:37pm

Hi again,

So I had another go at this. Using SILVA138.1:

This is what I did:
Import data:

Import the Taxonomy Rank file

qiime tools import --type 'FeatureData[SILVATaxonomy]'
--input-path tax_slv_lsu_138.1.txt
--output-path taxranks-silva-138.1-lsu-nr99.qza
Import the Taxonomy Mapping file

qiime tools import --type 'FeatureData[SILVATaxidMap]'
--input-path taxmap_slv_lsu_ref_nr_138.1.txt
--output-path taxmap-silva-138.1-lsu-nr99.qza
Import the Taxonomy Hierarchy Tree file

qiime tools import --type 'Phylogeny[Rooted]'
--input-path tax_slv_lsu_138.1.tre
--output-path taxtree-silva-138.1-nr99.qza
Import the sequence file:

qiime tools import --type 'FeatureData[RNASequence]'
--input-path SILVA_138.1_LSURef_NR99_tax_silva_trunc.fasta
--output-path silva-138-ssu-nr99-seqs.qza

Prepare the silva taxonomy prior to use

qiime rescript parse-silva-taxonomy
--i-taxonomy-tree taxtree-silva-138.1-nr99.qza
--i-taxonomy-map taxmap-silva-138.1-lsu-nr99.qza
--i-taxonomy-ranks taxranks-silva-138.1-lsu-nr99.qza
--p-include-species-labels
--p-no-rank-propagation
--o-taxonomy silva-138.1-lsu-nr99-tax.qza

Culling low quality sequences

qiime rescript cull-seqs
--i-sequences silva-138.1-lsu-nr99-seqs.qza
--o-clean-sequences silva-138.1-lsu-nr99-seqs-cleaned.qza

As suggested before I decided not to filter-seqs-length-by-taxon and do it after extracting the amplicon region.

Dereplication of sequences and taxonomy

qiime rescript dereplicate
--i-sequences silva-138.1-lsu-nr99-seqs-cleaned.qza
--i-taxa silva-138.1-lsu-nr99-tax.qza
--p-rank-handles 'silva'
--p-mode 'uniq'
--o-dereplicated-sequences silva-138.1-lsu-nr99-seqs-derep-uniq.qza
--o-dereplicated-taxa silva-138.1-lsu-nr99-tax-derep-uniq.qza

Extract the amplicon region from reference database

qiime feature-classifier extract-reads
--i-sequences silva-138.1-lsu-nr99-seqs-derep-uniq.qza
--p-f-primer GTAACTTCGGGAWAAGGATTGGCT
--p-r-primer AGAGTCAARCTCAACAGGGTCTT
--p-min-length 250 --p-max-length 600
--p-n-jobs 2
--p-read-orientation 'forward'
--o-reads silva-138.1-lsu-nr99-seqs-GA20F-RM9R.qza

After this I visually checked the silva-138.1-lsu-nr99-seqs-GA20F-RM9R.qza also tried evaluate-seqs all sequences are in the 250-600bp range. Should I assume that there are no outliers and thus no further filtering is needed?
Thats what I did..

Dereplicate extracted region

qiime rescript dereplicate
--i-sequences silva-138.1-lsu-nr99-seqs-GA20F-RM9R.qza
--i-taxa silva-138.1-lsu-nr99-tax-derep-uniq.qza
--p-rank-handles 'silva'
--p-mode 'uniq'
--o-dereplicated-sequences silva-138.1-lsu-nr99-seqs-GA20F-RM9R-uniq.qza
--o-dereplicated-taxa silva-138.1-lsu-nr99-tax-GA20F-RM9R-derep-uniq.qza

Build amplicon-region specific classifier

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads silva-138.1-lsu-nr99-seqs-GA20F-RM9R-uniq.qza
--i-reference-taxonomy silva-138.1-lsu-nr99-tax-GA20F-RM9R-derep-uniq.qza
--o-classifier silva-138.1-lsu-nr99-GA20F-RM9R-classifier.qza

A few weird things:

The classifier is only 14Mb. I have built classifiers for 16S and 18S and both are around 170Mb. SILVA LSU database is smaller than SSU does that explain the difference?
This kind of classification:
d__Eukaryota; p__Chytridiomycota; c__Chytridiomycetes; o__Rhizophydiales; f__; g__; s__Rhizophlyctis_rosea
According to the tutorial this empty ranks are normal right?

28S_taxonomy.qza (97.1 KB)

I wanted to upload the classifier and extracted sequences but I cant seem to do it.

Uff sorry for the long post
Cheers,
Hugo

SoilRotifer · February 15, 2021, 4:08pm

Hi @HugoEira,

That sounds reasonable to me.

Yeah, the LSU db is much smaller than the SSU db.

If you are not using rank propagation, then yes. I often use rank propagation as it helps with the other downstream use-cases and validation steps that require a rank at each position, as you've noticed.

This all looks sane to me. I guess let us know how well it works out for you.

-Mike

HugoEira · February 15, 2021, 4:26pm

Thanks for the help.

So far it seems to be working, don't have much to compare it to.
Well I guess is time to get back to the lab and get more data

Thanks again.
Cheers,
Hugo

system · March 18, 2021, 10:26pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.