Changing tip labels on rooted-tree.qza from sequence identifiers to SILVA taxa annotations using R

buserj · March 10, 2020, 7:32pm

Hello all, and thank you in advance for your time.

My question relates to the rooted-tree.qza created in QIIME 2 2019.4, and how to change the tip labels from sequence IDs from DADA2 to taxa annotations, using R.

I have exported rooted-tree.qza to newick format:

qiime tools export
--input-path rooted-tree.qza
--output-path exported-tree

...and uploaded it to R using the package 'ape':

phy <- read.tree(file = "tree.nwk")

Additionally, I have exported and modified the taxonomy.qza created with SILVA classifier.

qiime tools export taxonomy.qza --output-dir exp-tree
echo $'LABELS\nSEPARATOR TAB\nDATA' > taxonomy.txt
sed "1d" taxonomy.tsv | cut -f1,2 >> taxonomy.txt

I then uploaded taxonomy.txt as a df in R, as seen below:

head(newtips)
seq.id
1 677dc4d80ebf1de5f0abff920ff79a3e
2 6bf1a63f1d1b1de9479fb487edb3c077
3 1170dd578d8b706ee98d951cc6c6e6b0
4 7059e70b32a6e91148afa4a361a2f2ef
5 96d7fe52dfcf9a158701479bd8af75e0
6 ff232c0997800a1e4b231ac8856f682a
taxa
1 D_0__Archaea;D_1__Euryarchaeota;D_2__Methanomicrobia;D_3__Methanosarcinales;D_4__Methanosarcinaceae;D_5__Methanolobus
2 D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Enterobacteriales;D_4__Enterobacteriaceae;D_5__Escherichia-Shigella
3 D_0__Bacteria;D_1__Cyanobacteria;D_2__Oxyphotobacteria;D_3__Chloroplast
4 D_0__Bacteria;D_1__Cyanobacteria;D_2__Oxyphotobacteria;D_3__Chloroplast
5 D_0__Bacteria;D_1__Bacteroidetes;D_2__Bacteroidia;D_3__Bacteroidales;D_4__Bacteroidetes vadinHA17
6 D_0__Bacteria;D_1__Proteobacteria;D_2__Deltaproteobacteria;D_3__Syntrophobacterales;D_4__Syntrophaceae;D_5__Syntrophus

My goal is to match the sequence IDs on my tree (vector named phy) to the annotations dataframe (newtips). I have attempted to use join(), to replace the IDs, as seen below:

phytips <- phy$tip.label
relabeled.phy <- join(newtips, phytips)

...but this has been unsuccessful. I have also annotated my tree using iTOL, yet my exported newick file is saved as .txt, posing new and additional difficulty. I am open to any suggestions?

Thank you for your patience and help,

Jessica

timanix · March 10, 2020, 8:33pm

Hi!
I don't think that you can just replace sequence ID's by taxonomy annotations since they are not equally abundant - several ASVs may be assigned to the same taxonomy. But you still can combine them.
When I needed a tree with taxonomy annotations I exported a biom table, taxonomy file and rep-seq.qza, modified IDs in the same way, imported back and recreated a tree with those files. Be aware of numerous symbols that are in taxonomy and may be not allowed by the plugin for tree construction.

buserj · March 10, 2020, 9:08pm

Hi Timanix,

This was incredibly helpful. Thank you for addressing my concerns!

buserj · March 10, 2020, 10:24pm

Quick follow-up question:

I have successfully aligned the sequence IDs with the annotations within R, yet I am having difficulty importing my transformed rep-seqs.txt file back into QIIME. I have converted it to a json .biom format file, yet what "importable type" should I be using? I have tried the following:

qiime tools import --input-path rep-seq-transformed.biom --type 'FeatureData[AlignedSequence]' --input-format BIOMV210Format --output-path rep-seqs-transformed.qza

and received the following error:

An unexpected error has occurred:

No transformation from <class 'q2_types.feature_table._format.BIOMV210Format'> to <class 'qiime2.plugin.model.directory_format.AlignedDNASequencesDirectoryFormat'>

I appreciate any suggestions you may have for me?

Thank you.

timanix · March 11, 2020, 5:46am

Hi again!

You don't need to convert it to biom file for rep-seq.qza since it should be in fasta format.

qiime tools import \
    --input-path fasta_file \
    --output-path modified-rep-seqs.qza \
    --type 'FeatureData[Sequence]'

The steps I performed.

Exported table.qza as a biom. converted biom to .tsv. Modified a .tsv file, converted to biom and imported as table.qza
Exported rep-seq.qza. Modified sequence ID's in the same way as in 1 step. Imported back to rep-seq.qza
Exported taxonomy.qza as a .tsv table, modified in the same way as first two, imported back
Builded a tree in qiime2.

buserj · March 11, 2020, 7:03pm

Thank you. This has been very helpful!

Sam_Degregori · March 17, 2022, 6:51pm

Hi all,

Sorry to bump this old thread but

I was wondering if there his now a faster way to convert a qiime rooted-tree.qza into an R friendly tree with taxa annotations rather than feature IDs?
If not, following @timanix 's method, I basically have to replace all seq IDs with taxonomy IDs and then do I simply combine them so there aren't duplicates? And by combining I would have to add up the abundance values right so the counts aren't lost?

Thank you!

timanix · March 17, 2022, 7:28pm

Hello!
Unfortunately, I am not aware of Qiime2 way to rename ids in already constructed tree.

To keep lower resolution, I take last available taxonomy rank and add sequence ID to it, separated by "|".

Sam_Degregori · March 19, 2022, 5:06pm

Hi @timanix,

Thanks for the quick response. So I ended up having to filter my repseqs and make a new taxonomy so that everything lined up. Now I am stuck at the rep seqs step because it is my understanding that I cannot simply edit this in excel....
Do I basically need to write some script to replace every feature ID in the rep seq with the "tax;tax;tax|seqID" format that you mentioned?
Thanks,
Sam

timanix · March 19, 2022, 6:02pm

Hi,
Yeah, it is easier to write a script for it in python, R or other languages to create a new fasta file with modified feature IDs. I used biopython library in python to read original rep-sequences as a fasta file and to write a modified copy of it.

Sam_Degregori · March 23, 2022, 12:28am

After some trial and error I finally got a working biopython script and currently making a tree in qiime2 with the new rep seqs. Thank you for the guidance. Here is the code for anyone interested that combines taxonomy and seqids together and modifies the repseq fasta file (otuids_list is a python list of your seqids and tax_list is is a python list of your taxonomy-- both in the correct order):

# define new_list as dict with keys being sequence ids and values the taxonomy
new_list = {id: tax for id, tax in zip(otuids_list, tax_list)} # you need to provide this somehow
original = [s for s in SeqIO.parse('allmergedrep-seqsf.fasta', 'fasta-2line')]
corrected = []
for s in original:
  # here we put the requested ID format
  # note, that the FASTA ID usually do not contain spaces
  s.id = '{}|{}'.format(new_list[s.id], s.id)
  
  # BioPython sometimes adds IDs also here (and in some cases also to "s.name")
  s.description = ''
  
  corrected.append(s)

SeqIO.write(corrected, 'allmergedrep-seqsf2.fasta', 'fasta-2line')