Combo. Adding taxonomy names to hashes (feature IDs) for all the analyses in Qiime2. Python 3.

Dear qiimers!
I would like to share the tweak I used in my analyses to add last available taxon to feature IDs (hashes). It works with Silva annotations, maybe you need to adjust a little bit the code for GreenGenes or other databases. Code written in Python 3.
Example outputs of some Qiime plugins, performed after the tweak with ASV tables (not collapsed to taxonomy level):

Biplot

Feature volatility plot

Ancom test

Table from core features output

To perform the tweak, you need to install Jupyter lab and Biopython inside of activated Qiime2 environment. Also you need to launch Jupyter lab in activated Qiime2 environment.
Tweak can be applied after DADA2 (Deblur) and taxonomy assignment.

First cell in Jupyter lab:

#import
import pandas as pd
from Bio import SeqIO

#input
table    = 'table.qza' # input here the path to your feature table 
taxa     = 'taxonomy.qza' # input here the path to your taxonomy file
rep_seq  = 'rep-seq.qza' # input here the path to your representative sequences

#export rep-seqs.qza, table.qza and taxonomy.qza
!mkdir Biom Taxa Rep-seqs

!qiime tools export \
    --input-path $table \
    --output-path Biom/
!biom convert -i Biom/feature-table.biom -o Biom/feature-table.tsv --to-tsv

!qiime tools export \
    --input-path $taxa \
    --output-path Taxa

!qiime tools export \
    --input-path $rep_seq \
    --output-path Rep-seqs/


#replacing hashes with combination of taxonomy and beginings of the hashes
taxa_df = pd.read_csv('Taxa/taxonomy.tsv', sep='\t', skiprows=[1])
biom_df = pd.read_csv('Biom/feature-table.tsv', sep='\t', skiprows=1)

taxa_df['Taxon'] = pd.DataFrame(taxa_df['Taxon'].astype(str).str.replace(';__','').str.replace('[','').str.replace(']','')
                                .str.replace('.','').str.replace('/','_').str.replace("'",'').str.replace(' ','_')
                                .str.replace('uncultured_bacterium','unc_b').str.replace('uncultured_organism','unc_org')
                                .str.replace('uncultured','unc'))
taxa_df['Combo'] =  taxa_df['Taxon'].str.split("__").str[-1].str.split(";").str[-1]
taxa_df.loc[taxa_df['Combo'].str[:]=='unc_b','Combo']=taxa_df['Taxon'].str.split("__").str[-2].str.split(';').str[0] + '_unc_b'
taxa_df.loc[taxa_df['Combo'].str[:]=='unc_org','Combo']=taxa_df['Taxon'].str.split("__").str[-2].str.split(';').str[0] + '_unc_org'
taxa_df.loc[taxa_df['Combo'].str[:]=='unc','Combo'] = taxa_df['Taxon'].str.split("__").str[-2].str.split(';').str[0] + '_unc'
taxa_df.loc[taxa_df['Combo'].str[:]=='unc_marine','Combo'] = taxa_df['Taxon'].str.split("__").str[-2].str.split(';').str[0] + '_unc_mar'

for n in range(3,6):
    taxa_df.loc[taxa_df['Combo'].str[:]=='unc_b_unc_b','Combo']=taxa_df['Taxon'].str.split("__").str[-n].str.split(';').str[0]+'_unc_b'
    taxa_df.loc[taxa_df['Combo'].str[:]=='unc_unc_b','Combo']=taxa_df['Taxon'].str.split("__").str[-n].str.split(';').str[0] + '_unc_b'
    taxa_df.loc[taxa_df['Combo'].str[:]=='unc_b_unc','Combo']=taxa_df['Taxon'].str.split("__").str[-n].str.split(';').str[0] + '_unc'
    taxa_df.loc[taxa_df['Combo'].str[:]=='unc_unc','Combo']=taxa_df['Taxon'].str.split("__").str[-n].str.split(';').str[0] + '_unc'
    taxa_df.loc[taxa_df['Combo'].str[:]=='unc_unc_mar','Combo']=taxa_df['Taxon'].str.split("__").str[-n].str.split(';').str[0] + '_unc_mar'
    
biom_df['#OTU ID'] = taxa_df['Combo']+'|'+taxa_df['Feature ID']
taxa_df['Feature ID'] = biom_df['#OTU ID']
taxa_df = taxa_df[['Feature ID', 'Taxon', 'Confidence']]

### writing all files back
biom_df.to_csv('Biom/feature-table.tsv', sep='\t', index=False)
taxa_df.to_csv('Taxa/taxonomy.tsv', sep='\t', index=False)
fasta_hash  = r"Rep-seqs/dna-sequences.fasta"
fasta_combo = r"Rep-seqs/dna-sequences.fa"
hlist = biom_df['#OTU ID'].tolist()
with open(fasta_hash) as hashes, open(fasta_combo, 'w') as combo:
    for record in SeqIO.parse(fasta_hash, 'fasta'):
        for h in hlist:
            if str(record.id) in h:
                combo.write('>'+h+'\n'+str(record.seq)+'\n')
!rm $fasta_hash
!mv $fasta_combo $fasta_hash

The code above also will replace “uncultured bacterium” with “unc_b” and so on. All " " also replaced by “_”, as well as some symbols, that causing an error in downstream analysis.

Second cell:

### creating new rep-seqs.qza, table.qza and taxonomy.qza with modified hashes
!biom convert -i Biom/feature-table.tsv -o Biom/feature-table.biom --table-type="OTU table" --to-hdf5
!qiime tools import \
    --input-path Biom/feature-table.biom \
    --type 'FeatureTable[Frequency]' \
    --input-format BIOMV210Format \
    --output-path combo_table.qza
!qiime tools import \
    --type 'FeatureData[Taxonomy]' \
    --input-path Taxa/taxonomy.tsv \
    --output-path combo_taxonomy.qza
!qiime tools import \
    --input-path $fasta_hash \
    --output-path combo_rep-seqs.qza \
    --type 'FeatureData[Sequence]'

!rm -r Biom Taxa Rep-seqs

After it you can create a rooted tree using modified files and proceed with any other analysis in Qiime2, but in feature table you will have last available taxonomy unit added to the feature ID.

Authors: Timur Yergaliyev and Amir Szitenberg.

3 Likes