RESCRIPt, SEPP and database correlation

jwdebelius · January 29, 2021, 8:16pm

Hello all,

I'm working on a somewhat specific application where the leaves on my tree must match the representative database I use for fragment insertion. I tried getting my database from RESCRIPt as well as downloading the one from the Silva db. The Silva 128 from RESCRIPt ends up being ~500,000 sequences after i remove Eukeryotes. The downloaded database has about 300,000 sequences, but the sample IDs are not a subset. For reasons^TM, I'd like to use RESCRIPt to get and format my database, but I'm concerned about the lack of overlap between the SEPP tree and the database.

Anyway, suggestions, ways to make this easier, etc all welcome.

My RESCRIPt database was generated using these steps:

qiime rescript get-silva-data \
 --p-version 128 \
 --p-no-rank-propagation \
 --o-silva-taxonomy silva-128-taxonomy-raw.qza \
 --o-silva-sequences silva-128-seqs-raw.qza

qiime rescript cull-seqs \
 --i-sequences silva-128-seqs-raw.qza \
 --p-num-degenerates 5 \
 --o-clean-sequences silva-128-seqs-degen.qza

qiime taxa filter-seqs \ 
 --i-sequences silva-128-seqs-degen.qza \
 --i-taxonomy  silva-128-taxonomy-raw.qza \
 --p-exclude 'd__Euk' \
 --p-mode 'contains' \
 --o-filtered-sequences silva-128-bacteria-archea.qza

qiime rescript filter-seqs-length-by-taxon \
 --i-sequences silva-128-bacteria-archea.qza \
 --i-taxonomy silva-128-taxonomy-raw.qza \
 --p-labels Archaea Bacteria \
 --p-min-lens 900 1200 \
 --o-filtered-seqs silva-128-bacteria-archea-length-keep.qza \
 --o-discarded-seqs silva-128-bacteria-archea-length-discard.qza

qiime rescript dereplicate \
 --i-sequences silva-128-bacteria-archea-length-keep.qza  \
 --i-taxa silva-128-taxonomy-raw.qza \
 --p-rank-handles 'silva' \
 --p-mode 'uniq' \
 --o-dereplicated-sequences silva-128-seqs.qza \
 --o-dereplicated-taxa silva-128-taxonomy.qza

Nicholas_Bokulich · January 30, 2021, 10:06am

Sounds like an exciting secret project

Does sepp somehow reformat the seqIDs? it is surprising that there is not at least partial overlap! Could you share a sampling of IDs from the seqs and tree?

The ideal of course would be something we've discussed in the past — adding a method to build a sepp reference tree in fragment insertion or elsewhere... then rescript could be used to get/format seqs/ref trees, and plug directly into fragment-insertion. Some relevant discussion here and here:

jwdebelius · February 1, 2021, 4:58pm

Hi @Nicholas_Bokulich,

As far as I know, SEPP doesn't change database IDs. But, the fact that my RESCRIPt database doesn't overlap with the databse from the Silva site (by 200K sequences) and SEPP was built pre-RESCRIPt, I assume there's not a perfect overlap. It's the one case where the insertion tree has to cover the reference database. Otherwise, I have to back fitler the reference database to match the tree. (Which probably means a new, complex method, whoo!)

I've looked at building my own tree before and ended up intimidated by the verification and compute requirements, and I'm not sure it solves the question of why the two databases from the same version and dataset have different sequences.

Best,
Justine

Nicholas_Bokulich · February 1, 2021, 5:32pm

ah got it... a partial overlap then? the SEPP tree was probably built with all seqs from SILVA, and the cull-seqs and other filtering steps probably remove seqs that are in the tree then.

the degree of overlap would be useful to know, though, before finding a solution.

would be a useful method to add to q2-phylogeny

would have other practical uses as well... e.g., if seqs fail to insert into a tree, remove those seqs prior to taxonomy classification or other uses...

jwdebelius · February 2, 2021, 7:11pm

Hi @Nicholas_Bokulich,

ID Set	number of sequences
Silva 128 from site	`347,146`
Silva 128 download from rescript	`645,151`
SEPP silva 128 tips	`395,440`
site & rescript	`289,103`
rescript - site	`356,048`
site - rescript	`58,043`
SEPP & rescript	`327,809`
rescript - SEPP	`317,342`
SEPP - rescript	`67,631`
site - SEPP	`1`

Then, I tried filtering the rescript data to remove Eukeryotes, cull degenerates and homo polymers and do length filtering in the rescript database. I ended up with a filtered database with 612,514 sequences, of which 303,276 were in the tree and 309,238 were not.

The filtered data has 47,121 duplicated sequences, a total of 5,001 appear in the insertion tree.

I'm a little bit overwhelmed with the fact that my SEPP tree is missing more than half of my rescript database. And, maybe it's just easiest to design some kind of function to filter the database based on the SEPP reference set.

But, it's confusing that I have three different databases with three different disjoint ID sets.

Best,
Justine

Nicholas_Bokulich · February 3, 2021, 7:45am

Agreed, somewhat confusing. Some ideas:

I am guessing that some of the differences here might be due to the SSU Ref vs. NR99 versions, and different OTU clustering. "Site" is most likely NR99... is that also what you downloaded with RESCRIPt?

If "site" is the "qiime compatible release", note that this is not what RESCRIPt downloads and reformats... so that is probably a major source of divergence, that different OTU clustering steps are being performed, and different centroids chosen.

SEPP is probably based on the NR99 and doing its own filtering...

Given the very partial overlap, and that this is probably caused by different versions/filtering/clustering steps, I would worry a bit about just taking the intersection. how about this?

filter the seq database to grab the unique/non-overlapping IDs
use fragment insertion to insert those into the reference tree
remove the unique/non-overlapping IDs from the new reference tree

But ultimately I think that building a new reference tree from the reference sequences would be a less convoluted and maybe even faster route!

jwdebelius · February 3, 2021, 6:15pm

My struggle is that I'd prefer not to encourage people to build their own SEPP reference because I don't want to end up trying to support that in the same way that we try and support people building their own classifiers. (I also find it daunting) to do myself, so

I think for now, I'll just use the QIIME compatible and the pre-built SEPP tree .

I will probably need to consider a new insertion tree in the future, once our supercomputer allocation re-ups .

I think in general my conclusion is that more transparency between the 3 datasets would be helpful. And, I know RESCRIPt is intended to do that, which I do totally appreciate. But, the interplay makes edge cases like mine a little bit hard.

Thanks!
Justine

Nicholas_Bokulich · February 3, 2021, 6:28pm

fair point. I do not see this being a routine everyone-run-it sort of thing, though — rather, each SILVA (or other database) release we can run a pipeline on our cluster to prep the seqs, taxonomy, and tree and share them in data resources, just as we do currently with RESCRIPt for seqs + tree.

Then there is no need for everyone to run this every time... they can just grab the pre-formatted data from data resources, trim or filter as needed for their application, and then prune the tree.

Looks like these still differ by around 50k IDs, so if you need exact matches between the two you run into the same issue as with rescript vs. sepp

Anyway, good luck! Sounds like a challenging/interesting edge case!

SoilRotifer · February 3, 2021, 10:23pm

Hi @jwdebelius & @Nicholas_Bokulich,

I think Nick covered all the questions I was going to as as well. I do like this:

I've honestly been considering constructing a de novo phylogeny based on the curated alignment from SILVA and see how that compares to the SEPP tree as well as the full tree you can extract from the ARB file. Perhaps something worth investigating?

My plan for making a de novo phylogeny would be to:

process the unaligned sequence files through RESCRIPt.
use the IDs/Seqs of what remains to filter the aligned sequence file to the same set
entropy filter / remove column only gaps from the filtered alignment
construct a tree with IQ-TREE 2.

We have everything baked in to do this, except for reading in an RNA alignment. I have a PR to read that in as a temporary option. Though this will likely go away based on the ongoing refactoring of some of QIIME 2's sequence types.

-Mike