Building full-length rRNA database using RESCRIPt

TLBR007 · February 8, 2022, 3:33am

Hi,

I was wondering if it is possible to use RESCRIPt to build a full-length rRNA database for QIIME2 analysis. Our lab was able to amplify and sequence the entire bacterial rRNA operon (>4 Kb). As current reference databases (NCBI/Silva) only have either LSU or SSU, they can't take full advantage of the full-length rRNA we generated. Hence we were hoping to build a custom database that contains all full-length rRNA reference sequences from NCBI.

Any assistance would be greatly appreciated.

TLBR007 · February 9, 2022, 6:35am

Hi Mike @SoilRotifer,

Thanks for your quick reply. Due to the database issue I couldn't see your reply here, but luckily there was a system notification email that has your reply .

I have been trying different search strategies with rescript get-ncbi-data since yesterday. Using --p-query '((txid2[ORGN] OR txid2157[ORGN])) AND 16S ribosomal RNA gene, complete sequence[Title] AND 23S ribosomal RNA gene, complete sequence[Title] produced 1675 entries, which was not ideal. Then I tried --p-query 'txid2[orgn] AND "latest refseq"[Properties] AND 224116[BioProject IDs and Accessions], however no entry was identified, leading to an error message saying Plugin error from rescript:Taxonomy format requires at least one row of data. I think this error is due to the fact that rescript get-ncbi-data only searches the NCBI nucleotide database. Could you please let me know if it would be possible for rescript get-ncbi-data to search the NCBI assembly database?

Another option I am exploring is to use an existing fasta file. I have a huge fasta file (~100 Gb) containing all bacterial RefSeq genome sequences. I was wondering if it is possible to use rescript to add taxonomy info to those sequences.

Thanks again for your kind help!

Regards,
Tongda

SoilRotifer · February 9, 2022, 1:54pm

HI @TLBR007,

I guess my reply got sent out before I deleted it. I had realized that you would run into these issues within a few minutes of my post.

The problem I realized, as you've noted, is that the assemblies are not accessible within the nucleotide database. I was going to update my reply to say that you'd have to extract the Accessions from the headers of the downloaded FASTA files. Here is an updated version of my reply, in case others come across this post:

RESCRIPt cannot be directly used for this. Though it should be possible to prepare the data for use in QIIME 2 and RESCRIPt.

I'd recommend pulling microbial genomes from RefSeq. Basically, if you look here you can click on the links that lead to the various clades of Bacterial and Archaeal Assemblies. That is, if you click on the "Assemblies" link for "Acidobacteria", it'll take you here.

Then simply download the FASTA files of all the assemblies for each clade. Then you can merge them all into a single FASTA. Or you can simply use the search term to download all RefSeq Prokaryotic genomes:*

(txid2[orgn] OR txid2157[orgn]) AND "latest_refseq"[Properties]

But this is about 127 GB of data! So, I'd not recommend doing this for fear of a failing internet connection! If you do, I'd highly recommend doing this during NIH "off-hours". Better yet, I'd simply stick with the approach of downloading each clade of genome assemblies individually, and then merge them.

After these are in one big FASTA file you should be able to run qiime feature-classifier extract-reads ... using your primer sequences, to extract the operon region from these genomes. Actually, it might be better to run this command separately on each clade's FASTA file, as it can take a very long time for the extract-reads command to process a single large file. Then again, you can use multiple-cpu-threads for the extraction... Anyway, depending on how the genomes are linerarized this may or may not work.

The main issue is fetching the taxonomy. That is, you'd need to pull the accessions from the FASTA headers of the downloaded assemblies. Which would mean downloading and parsing NCBI's taxonomy on your own somehow....

Perhaps you can use one of the following tools to extract a fixed-rank taxonomy based on the accessions within the assembly FASTA files?

-Mike

TLBR007 · February 14, 2022, 6:29am

Hi @SoilRotifer ,

Thanks for your further explanation. I managed to download all bacterial genomes from NCBI RefSeq during the weekend, and have extracted more than 20000 full length rRNA sequences from those genomes. I agree with you that the actual hard work is to assign taxonomy to those sequences. Taxonkit seems to be really handy (thank you so much for recommending it!). My current plan is to create a separate tsv file that contains the taxonomy info of all those sequences, and then import them to qiime2, as suggested in this post.

Creating a custom reference database - User Support - QIIME 2 Forum

Thanks again for your help.

Tongda