Slightly different taxa with regional and full length taxonomy classifiers

kindergarten · February 4, 2022, 6:47pm

Followed the Rescript tutorial to generate silva-138-ssu-nr99-27f-534r-classifier.qza. Size of classifier is 99MB. It is smaller than 515F/806R (141MB) Silva classifier available on Qiime 2 Data Resources. My question: Is 99MB size OK for 27F-534R classifier?
I used this classifier to assign taxonomy to 21 samples, which were previously assigned taxonomy with Silva v138 Full-length (available on Dada Resources). Outcomes were not identical, but similar.

SoilRotifer · February 4, 2022, 7:22pm

Hi @kindergarten, welcome to :qiime2:!

That makes sense as the 27f-534r is smaller than 515f-806r. Also, the nucleotide sequence diversity can be quite different between the variable regions. That is, there can be drastic differences in the number of unique sequences left after dereplication the sequences between the variable regions.

This is expected.

-Mike

kindergarten · February 7, 2022, 5:33pm

Hi Mike, thanks for the comments.
Regarding the size, 27F-534R (V1-V3) is about 500 bp, where as 515F-806R is about 300 bp. 27F-534R is longer than 515F-806R by about 200 bp and is also 1/3rd of full length. So, theoretically, 27F-534R classifier should be around 200MB (515F-806R is 141MB, full-length is about 600 MB). I just want to be sure that I did not mess up while creating 27F-534R classifier. It was my first time building a classifier. Regards.

SoilRotifer · February 7, 2022, 5:52pm

Hi @kindergarten,

Opps, that was a mistake on my part... I got my regions mix-matched with the lengths on that one.

I think you should be on the right track. You can always private message me your QZA file, and I can look through your provenance to double-check. But my initial explanation still holds, i.e.:

Also, keep in mind that there are quite a few "nearly full-length" sequences in these reference databases. That is, many reference sequences may not even have the ~27F portion available. Thus, when trimming based on these primer pairs, many reference sequences are lost as the 27F primer region of the reference sequence is simply not present. This can also drastically reduce the number of available reference sequences after trimming based on primer sequences.

There is another approach you can try... qiime rescript trim-alignment ... If you know the primer locations as they'd map to the SILVA alignment you can simply enter them in, and keep everything from the original silva alignment within those column positions. You'd have to download the the alignment from SILVA and import it and then trim the alignment. Then after trimming, be-sure to run rescript degap-seqs ...

Some details / explanation can be found from step 7 of this older pipeline. The rescript trim-alignment command is intended to replace this step.

-Cheers!
-Mike

kindergarten · February 8, 2022, 8:35pm

Mike, I will highly appreciate if you could inspect my classifier.qza. How do I private message you?

I assumed, naively, that Silva database has full length 16S rRNA sequences. I have lot to learn.

Thanks again

SoilRotifer · February 8, 2022, 10:25pm

Hi @kindergarten, not a problem. That is why we're here.

All you need to do is click on my name / avatar and you'll see a dialog. Then you simply click "Message".

system · March 12, 2022, 4:26am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.