Greengenes Versus Silva

Tiago_Bruno_Rezende · January 29, 2020, 3:09am

Hello everyone,

I know, there are at least 10 topics in the forum with the same title. I am posting this topic as the last resource to find some answers for a few doubts regarding Silva. In advance, I will disclose that am not a big user of 16S analysis, so I beg your pardon for my naiveness.

I have been still using greengenes when I need 16S results and failed when trying to move to Silva. Now I see that greengenes is impressively outdated compared to Silva. Now I think its time for a definitive update.

A few questions I couldn't find an explanation for anywhere else:

Why does the SILVA database has Uracils instead of Thymines in their fasta sequence? am I downloading the wrong database file? I understand that this is rRNA, but sequences made from reverse transcriptase are DNA so I don't understand why greengenes wasn't like that. I looked for the SILVA_138_SSURef_NR99_tax_silva.fasta, which contains a non-redundant set of sequences. Does 16S software like vsearch recognize the Us as being related to the Ts? (Why just not change the Us in the database for Ts like everywhere else?)
Does Silva give a 16S database? Or is it always mixed with 18S? I find a bit annoying to have 18S sequences in the database if I am not amplifying 18S sequences. Also if I am using 16S primers and somehow get contaminating 18S OTUs due to similar sequences.
In which situations would I prefer using the align database instead of the regular fasta database? Which advantages (or downsides) do I get with the align files? I get kind of overwhelmed by thoughts just by looking at it, its a crazy one.

Thanks for all the help and sorry for your trouble.

SoilRotifer · January 29, 2020, 4:43pm

Hi @Tiago_Bruno_Rezende, welcome to :qiime2:!

Great question! The reason that the SILVA alignment uses Uracils instead of Thymines, is because the curated sequence alignment is informed by secondary structure in order to reduce alignment ambiguity. So, we honor the reality of the rRNA molecule when using this secondary structure information to inform our alignment. Also, it is easy enough to simple replace these when needed.

Nope. Though I am sure you can find files generated by third parties. The 16S and 18S rRNA genes are in fact homologues . This is why it is valid to keep them in the same alignment. I personally prefer have the 18S rRNA gene sequence data present in my reference taxonomy and sequence files. This helps with the identification and removal of off-target (unwanted) 16S and 18S sequences, as these will be classified as such . That is, it is quite common for primers to amplify off-targets of host organisms. Hence, the occasional need for blocking-primers or peptide nucleic acid (PNA) clamps.

I refer you again to the topic of reducing alignment ambiguity, and this SINA tutorial, which is still a work in progress. Historically, other tools like PyNAST , Infernal and SINA, use a curated secondary structure informed alignment to guide the alignment of unaligned sequence data. The idea is that this will create a more robust alignment for the generation of an improved de novo phylogeny. Though, in many cases tools like MAFFT, etc.. appear to perform generally well enough, w/o the need for secondary structure information. Though your mileage may vary.

-I hope this helps!
-Mike

Tiago_Bruno_Rezende · January 29, 2020, 4:48pm

Exactly the level of detail I was looking for. Thank you!!

SoilRotifer · January 29, 2020, 6:14pm

Glad we were able to help @Tiago_Bruno_Rezende!

-Mike