FOR ALL PEOPLE WHO LIKE ME HAVE NO EXPERIENCE WITH PYTHON:
the ready to use script is (see attached) script.py (284 Bytes) .
I used the phylogenetic tree 99, if you use a different tree you must open the script and replace the number in the text.
To use it in the shell just write:
python script.py xx_otus.tree
the output of this command will automatically write the file in the format you need xx_otus.qza
I have read in the web how to use python and it was very simple using the script example of the link suggested by Nicholas.
Hi @rparadiso,
Bravo! I hope this is a first step in a long and fruitful journey with python
Note, you can use python directly by opening a python interpreter. Just type "python" into your terminal and you are in — I should have clarified before that the example shown at that first link is showing how to fix this directly in the python interpreter, not in bash.
But if you prefer to make a script so you don't need to use the python interpreter, that's fine too! I made some little changes to your script so that you do not need to open the file and rename the input file, you can just write the input filepath on the command line. fix_tree.py (286 Bytes)
Use like this: python /path/to/fix_tree.py /path/to/99_otus.tree
(change to include the true filepaths to the script and tree on your computer)
This will read in the newick-format tree file and output a Phylogeny[Rooted] artifact with the same file name as the input file, but with the .qza extension.
Thanks again for the above fix_tree.py script, @rparadiso and @Nicholas_Bokulich! We were able to complete our analysis with out new proposed pipeline thanks to this.
Now, @Lauren and I are trying to complete the same pipeline again (DADA2 --> closed ref OTU clustering --> core metrics analysis --> taxonomy using VSEARCH), but this time using SILVA as our reference database instead of Greengenes. I was able to download the folder called "SILVA_132_QIIME_release" from the SILVA website (Archive), which has all of the equivalent files to the Greengenes 13.8 release folder we used before.
We were able to complete the closed reference OTU clustering, and we successfully ran the fix_tree.py script on the trees provided in the SILVA folder without error. However, when we used the resulting XX_otus.qza file made from the tree for the core metrics analysis, we received the following error:
It says the tree must be rooted, and therefore I guess the python script didn't work? Is the script designed specifically for a Greengenes tree, and if so, is there a way we could modify it?
Hi @cjone228, yes that script was specifically for greengenes trees, which appear to be missing the root branch length, so does not need to be run on the SILVA reference tree.
I am not totally sure but the SILVA trees might be unrooted, in which case you should import as unrooted and root it with qiime phylogeny midpoint-root.
Note that after using closed-ref OTU clustering there is no need to assign taxonomy. This is because closed-ref is aligning your sequences against a reference sequence, and the closest hit is taken as the representative sequence. Thus, you can just adopt the reference tree and taxonomy instead of building/classifying your own. There is no need to reclassify these sequences, since they consist of the full-length reference sequences at that point and you would just align them back against themselves.
This is a weakness with closed-reference clustering, in my opinion. It means that you must adopt the taxonomy (including any species information) of the closest match, but does not give you any confidence that that is the "correct" taxonomy (e.g., there could be an equally good match in the database that belongs to a different species but closed-ref just takes one top hit so would give you the false impression that the species information is correct).
In your case, the strengths of closed-ref (for coalescing multiple amplicons into a single full-length sequence and tree) outweigh these weaknesses (taxonomy and losing sequences that do not align to the reference)... so all I am recommending is that you continue with your proposed pipeline but (a) don't bother with taxonomy classification and (b) take species IDs with a grain of salt.
You could classify your ASVs taxonomically, collapse at species level, and then compare those results to the results from closed-ref to get a sense of how accurate the closed-ref species-level information is.
Thank you so much for the reply. The qiime phylogeny midpoint-root method fixed the issue! I was wondering, how come we didn't apply this method for the Greengenes tree earlier instead of using the fix_tree.py script?
Also, @Lauren and I were wondering if you could clarify what you mean by adopting the reference taxonomy? Below is an example what we did for our mock community with the Greengenes files:
Because the greengenes tree is already rooted, but the root is missing the length annotation, so it's a different problem entirely.
When you do closed-reference OTU clustering, the output representative sequences are actually the reference sequences that are top hits to your query seqs, and the feature IDs are the reference sequence IDs that correspond to the reference taxonomy for those seqs. So if these are your original query seqs:
id
sequence
s1
AAAAAA
s2
CCCCCC
and these are your ref seqs:
id
sequence
REF1
GTGTACGATGCTGACAAAAAAGTGCTGTAGTCGTGC
REF2
ATGGCTGTCTGCTGCCCCCCCCCGTGCTGTAGTGTC
REF3
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGG
REF4
ACACACACACACACACACACACACACACACACACAC
then the output of closed-ref clustering will yield representative seqs that look like this:
id
sequence
REF1
GTGTACGATGCTGACAAAAAAGTGCTGTAGTCGTGC
REF2
ATGGCTGTCTGCTGCCCCCCCCCGTGCTGTAGTGTC
So these are now the full-length reference sequences, and you can just look up the taxonomy associated with those reference sequences, instead of attempting to reclassify them.
So following your example, if this file is the output of closed-ref clustering:
Then there is no reason to classify with classify-consensus-vsearch, just lookup the taxonomy in this file:
and any QIIME 2 action to which you would pass this artifact (e.g., to make barplots):
can instead just accept the reference taxonomy as input to map the taxonomy associated with your closed-ref OTU representative sequences: