Getting the correct Silva 128 file for Sidle following the rescript download

Hi @jwdebelius, @elsamdea and @nandreani,

I am having problems in the reconstruction of the phylogenetic tree just in this step.

I used the SILVA 128 database to complete the Sidle tutorial with my data. The database was initially downloaded following next command and processed following the tutorial.

qiime rescript get-silva-data \
    --p-version '128' \
    --p-target 'SSURef_NR99' \
    --p-include-species-labels \
    --o-silva-sequences ./silva_128_ssu_nr99_rna_seqs.qza \
    --o-silva-taxonomy ./silva_128_ssu_nr99_tax.qza

When I wanted to reconstruct the Phylogenetic Tree using qiime sidle reconstruct-fragment-rep-seqs, I dowloaded the file Silva_128_release.tgz and I used the fasta file 99_otus_aligned.fasta inside rep_set_aligned folder in the --i-aligned-sequences parameter. I obtained the error attached in the document. It looks like I was using information from different SILVA versions but, I think that this is not the case.

Maybe I'm not using the correct aligned-sequences file?

Regards,

Andrés

qiime2-q2cli-err-iyyulykt.txt (104.5 KB)

Hi again,

To complete my previous question, I would like to add the summary visualization for the reconstruction map. Maybe, it could be useful.

Andrés

database_recons_summ.qzv (6.8 MB)

Hi @andresarroyo,

I apologize for my late reply. I'm struggling a little bit, because the error message is saying that the sequences aren't in your files. That means one of a few things. The first is that you somehow have a database mismatch. (As full disclosure a lot of my Silva 128 tests were pre RESCRIPt). It could be that there's a prefixing step that got skipped, although those look like normal Silva IDs. So, I'm not sure why there is such a discontinuity between the two sets.

Best,
Justine

Hello @jwdebelius,

Sorry for my late reply but I had other projects and I'm comming back now to try to solve this problem. As we said, it seems that there is a mismatch between SILVA 128 IDs and 99_otu_aligned.fasta IDs that produce some problems when I try to reconstruct fragment representative seqs (qiime sidle reconstruct-fragment-rep-seqs) just before tree reconstruction.

I have worked with both SILVA 128 SSU Ref and SILVA 128 SSU Ref NR99 versions and the 99_otus_aligned.fasta and I have checked that this problem occurs with both versions.

After database filtering steps, SILVA 128 SSU Ref and SILVA 128 SSU Ref NR99 had 382,839 and 328,454 sequences, respectively, while the 99_otu_aligned file has 395,440 sequences. A total of 97,209 SSU Ref IDs to 99_otu_aligned and a total of 80,587 SSU Ref NR99 IDs to 99_otu_aligned were lost.

Additionally, I have checked the SILVA_128_notes.txt file developed by @SoilRotifer to clarify how the 99_otus_aligned.fasta was obtained. Here, if I'm not wrong, the start point to obtain the representative aligned sequences at different identity levels (80%, 90%, 94%, 97% and 99%) was the SILVA_128_SSURef_tax_silva_full_align_trunc.fasta.gz file. This file has 1 922 213 sequences and its IDs match perfectly with the remaining IDs after filtering both SILVA 128 databases. Then, all IDs in 99_otus_aligned.fasta are in the SILVA_128_SSURef_tax_silva_full_align_trunc.fasta.gz too. I don't know why we see the mismatch but I think that the only step in the workflow of Silva_128_notes.txt where there is a label manipulation is here (4th paragraph) with the fix_fasta_labels.py script:

Maybe this information can give you some idea about what happens :thinking:

Alternatively, how reliable is to generate an alternative alignment object using the sequences after filtering SILVA 128 with Rescript and qiime alignment maff?

My main objective here is to reconstruct the tree to work with qiime picrust2 custom-tree-pipeline. I have read in other post that sidle results are not compatible with picrust2 but I don't know if this is only in terms of representative sequences because the output of qiime sidle reconstruct-fragment-rep-seqs is specific for tree reconstruction but it is not suitable for picrust2. So, could be the sidle reconstructed table and sidle reconstructed tree used as inputs in picrust2?

If I couldn't reconstruct the Phylogenetic Tree, I have read that an alternative is to select one V region and works with its feature table and representative sequences in Picrust2 but I'm not sure how suitable and justifiable is this approach when I have used the reconstructed results for previous analysis (diversity, differential abundance analysis...). Maybe future problems with reviewers here if I use only one V region? In addition, I think this is important because if I couldn't generate the tree, I would change from SILVA 128 to SILVA 138.

Unfortunately, other alternatives as Tax4Fun2 are out of service for technical reasons now.

Best,

Andrés

Hi @andresarroyo,

So I don't have great insight on the tree/sequences, maybe @SoilRotifer can help more with that?
There might need to be an additional cross-filtering step of figuring out which sequences are in which iteration. (I'm sorry, I've also been pulled off Sidle onto other projects).

You'd essentially have a build a new SEPP insertion tree backbone. If you did this, lots fo people would be greatful, but I dont know of anyone who has a new one. We've been tlaking about it for at least 4 years and no one has actually done it.

As far as PICRUSt goes, I have two questions, one more polite than the the other. I'll start with the ruder question, whcih is why do you need to run PICRUSt in the first place? Do you have a specific hypothesis that you think can be tested via functional inference, or is it because you think that's what you "should" do for a 16S analysis because having a taxonomic description is insuffecient for your data?

Either way, Sidle won't play nicely with the way PICRUSt 2 is structured (it might work with PICRUSt 1 if you could get perfect resolution, but I dont think that's recommended). So, if you had a hypothesis in a well defined environment, I would recommend functional inference on a single region since that would be the closest recommended solution for PICRUSt.

Best,
Justine

Thanks for your quick reply, @jwdebelius.

I take note of SEPP insertion tree backbone. I would like to try it. Do you know specific material to learn about it? I have never done something similar.

In relation to the use of Picrust2, the project is focused on cancer microbiomics, so we compare healthy and cancer patients. Based on previous results, we hope that specific biomarkers and biological functions/metabolic pathways are more associated with cancer patients. As we don't have direct functional data (eg. metatranscriptomic) I thought that Picrust2 could be a good alternative, despite its limitations. Do you think is its use in this context appropiate?

So, I will think about to select a specific V region for Picrust2. I hope that to select a region for a specific analysis were not be a problem. Which aspect do you consider important to select the region? Sequencing coverage and/or diversity indexes in each V region? Maybe the most informative V region in SILVA?

Best,

Andrés

Hi @andresarroyo,

Here's the repository where trees were built:

If you do build one, we would love it if you could share it with the community.

Im not sure the use of PICRUSt is generally appropriate, but if you're working in a well defined enviroment, you can probably get away with it. I would look at interpretation before you run all your data through, though, and understand what the feature table will be. Make sure that you can analyze the data you'll get before you go run the code.

As far as regions go, I'd look for a better defined and characterized region, probably something close to V4.

Remember that PICRUSt2 uses a different tree and configuration, so your prediction is going to be based off that.

Best,
Justine

Thank you so much for all the information @jwdebelius ! I will take into account your advises.

Best,

Andrés

1 Like