Normalize ASV by 16S copy number

Hi all,
I recently wrote a python script that can normalize ASVs with 16S rRNA gene copy number. Hope it works well and suggestions are welcome!

Introduction:

Normalize sequences by 16S rRNA gene copy number (GCN) based on rrnDB database (version 5.6). The script matches the taxa of sequences with the rrnDB-5.6_pantaxa_stats_NCBI.tsv file, starting from the lowest rank. If a match is found, the mean of GCN for the taxon is assigned; if not, the script will try to match a higher rank until the highest rank is met. All the unassigned sequences are assumed to have one GCN.

Note that the mean column in the rrnDB-5.6_pantaxa_stats_NCBI.tsv is, according to the rrnDB manual, calculated from the means of the pan-taxa of immediate lower rank. Therefore, the mean of GCN might be different from the rrndb online search result. For example, the “mean” of GCN for bacteria is 2.02 in the downloading tsv file, whereas the mean of GCN for all the bacterial taxa is 5.0 if you search rrnDB online database.

Setting path:

First clone the repository using command below:

git clone https://github.com/Jiung-Wen/16S_copy_num_normalize.git

To set up the path, open .bash_profile (macOS) or .bashrc (Linux) with any text editor you prefer:

vim ~/.bash_profile

In .bash_profile, append the following command:

export PATH="/YOUR_PATH/16S_copy_num_normalize/copy_num_normalize/:$PATH"

Save and close the file. Restart your terminal or using command below to apply the change immediately :

source ~/.bash_profile

Usage:

We assume that you have installed and activated QIIME2 environment.

copy_num_normalize.py --table table.qza --taxonomy taxonomy.qza -d silva -o output_file_name
  • --table PATH - path of QIIME2 artifact FeatureTable[Frequency]
  • --taxonomy PATH - path of QIIME2 artifact FeatureData[Taxonomy]
  • -d STRING - database used for sequence annotation {silva, greengenes}
  • -o PATH - path of output directory and file name (If path is not gave, output files will save to current directory.)

Running example:

We use artifacts from QIIME2’s “Moving Pictures” tutorial as test files. Use the following commands to download the files.

# DADA2 output artifact:
wget https://docs.qiime2.org/2019.10/data/tutorials/moving-pictures/table-dada2.qza

# Taxonomic analysis output artifact:
wget https://docs.qiime2.org/2019.10/data/tutorials/moving-pictures/taxonomy.qza

We can normalize the FeatureTable using the command below:

copy_num_normalize.py --table table-dada2.qza \
  --taxonomy taxonomy.qza \
  -d greengenes \
  -o table-dada2

The outputs would be a GCN normalized artifact table-dada2_copy_number_normalized.qza of type FeatureTable[Frequency] and a .txt file table-dada2_16S_rRNA_copy_number.txt that indicates the GCN for each sequence.

Now you can perform analyses as you usually do in QIIME2 with the GCN-normalized FeatureTable.

8 Likes

Thanks for sharing this @jwchen! And thanks for putting together such a nice tutorial here and on the github page… I love the figures you posted there, the “before” and “after” views are great! And I also appreciate that you posted a link to some discussion of the topic on this forum :smile:

Are you interested in taking this to the next step and writing your own QIIME 2 plugin? We can help point you in the right direction.

It would be more work but there are a few benefits, the main benefits being:

  1. Provenance would be preserved. This is really important for an intermediate conversion step like copy number variation… since this will be used before running other downstream steps. As it stands, provenance will be broken by this script (unless if you are doing something fancy that I did not see!) so someone using this will see what happened after GCN normalization
  2. Users could access this from any of the QIIME 2 interfaces, so would not need to be able to run on command line alone.
  3. You can let QIIME 2 do some of the heavy lifting of format validation of inputs, etc.

Let me know if you are interested!

5 Likes

Hi @Nicholas_Bokulich, I am excited about making a new plugin! Actually I was worry about that user might accidentally normalize his/her data multiple times and get the wrong result because the script is so far not type-aware. I think with the help from you guys, I can write the plugin and make some contribution to QIIME 2 community!

4 Likes

Hi @jwchen,

That’s awesome! In case you haven’t seen it, we have some developer documentation here:
https://dev.qiime2.org/latest/

And of course, ask any questions you may have on the forum; we can help!

3 Likes