creating FeatureData[Sequence] to build a phylogenetic tree

kam · August 15, 2023, 1:41pm

Hi!

I am working on a dataset in which I recived processed tables (after dada2 + taxnomic annotations) from a collaborator, which includes an ASV feature table and a taxa table from Phyloseq, both have been uploaded fairly easy to qiime2.

asv_table_q2 = q2.Artifact.import_data(type="FeatureTable[Frequency]", view=asv_table.T)

taxa_master = taxa_master.fillna("")
taxa_master["Taxon"] = taxa_master.apply
(lambda x: f"k__{x['Kingdom']};p__{x['Phylum']};c__{x['Class']}
;o__{x['Order']};f__{x['Family']};g__{x['Genus']};s__{x['Species']}", axis=1)
taxa_master = taxa_master["Taxon"]
taxa_master = taxa_master.rename_axis("Feature ID")
taxa_master_q2 = q2.Artifact.import_data("FeatureData[Taxonomy]", taxa_master)

In addition, I want to create a FeatureData[Sequence] artifact, derived from the taxa_master, to create a phylogenetic tree. the taxa master contains an index which is the 16S seq after DADA2 and a columns with taxonomic information (unnecessary for the tree).
I have tried to create a table with index and column both rpresenting the sequence, as follows:

sequence_table = pd.DataFrame(taxa_master).copy()
sequence_table["Sequence"] = sequence_table.index
sequence_table = sequence_table.drop("Taxon", axis=1)

And then either use the dataframe or a tsv file to create a FeatureData[Sequence] artifact, unsuccessfully.
This is the error from using the dataframe:

No transformation from <class 'pandas.core.frame.DataFrame'> to <class 'q2_types.feature_data._format.DNASequencesDirectoryFormat'>

And the error for using the tsv file:

First line of file is not a valid description. Descriptions must start with '>'

What would be a workaround here to create the desired artifact?

Thanks!!

SoilRotifer · August 15, 2023, 2:13pm

Hi @kam,

I might be missing something here, but you require the actual ASV sequences in order to construct a phylogeny. If I remember correctly, phyloseq objects to not store sequence information, only the feature table and the taxonomy. Sequences can not be derived from this data. You'll need to ask your collaborator for the sequence data output from DADA2.

-Mike

kam · August 15, 2023, 2:32pm

Hi Mike,

In my case, I do have the actual ASV sequences, both in the feature table (each feature is an ASV sequence) and in the taxa_master (each sequence, which are the indexes of the dataframe, correspond to a taxonomic anntoation).

My problem is what type of data should I create to upload a FeatureData[Sequence]

SoilRotifer · August 15, 2023, 3:40pm

Oh, I see. I forgot that the feature labels in R / DADA2 are in fact the sequences themselves, and not the hashes. So, you'd just need to extract those IDs from the dataframe and write out a standard FASTA formatted file. Then you can import like so:

qiime tools import \
    --type 'FeatureData[Sequence]' \
    --input-format 'DNAFASTAFormat' \
    --input-path  my-seqs.fasta \
    --output-path my-seqs.qza

kam · August 15, 2023, 4:24pm

Thanks!
Is this the only way? I was kinda hoping avoid the creating FASTA file solution.

SoilRotifer · August 16, 2023, 4:36pm

Well, no matter what you do, the data will need to be converted to FASTA format at some stage to be fed to the phylogeny tools. Thus, even if you do not do this explicitly, the code will do so behind the scenes anyway. Meaning, that you might as well write out / save the FASTA file to make it easier to reuse.

If you look at the q2-types code, for example here, you'll see that there is a way to convert a pandas series of sequence data into FASTA format. Once you have that you can access that artifact directly in your code.

Below is a simple example to help get you started. We'll assume you've stored your feature-ids and sequences as a pandas series. Just to provide a reproducible example, I'll import a TSV file where the first column is my feature-id and the second column is my sequence data. In your case, both will be the sequences themselves.

import pandas as pd
import qiime2 as q2
from q2_types.feature_data import DNAFASTAFormat
import qiime2.plugins.phylogeny.actions as phylogeny_actions

Read in tsv file:

seqs = pd.read_table('dna-sequences.tsv', 
        sep='\t', 
        header=None, 
        index_col=0)

set index name

seqs.index.name = 'feature-id'

make into series

ss = seqs[1]

import as an artifact:

ss_fasta = q2.Artifact.import_data('FeatureData[Sequence]', ss)

run de novo phylogeny pipeline

Or fragment insertion. We'll use de novo here.

phylo_results = phylogeny_actions.align_to_tree_mafft_fasttree(
                                 sequences=ss_fasta,)

see what is contained within the results

See tutorial , select Python API from drop-menu.

phylo_results

access specific results:

phylo_results.rooted_tree

save phylogeny to file:

phylo_results.rooted_tree.save('rooted_tree.qza')

system · September 16, 2023, 10:36pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.