Create NCBI database for identifcation with ITS region

AliciaD · April 30, 2021, 2:45pm

Hello everyone,

I am looking for a way to create files including a taxonomy and the ref-seq for a list of about 1900 plants I have (I only have the species names for the moment). I first tried to use the PLANiTS and UNITE reference databases but less than 50% of my plant list is present in these databases.
These files will be used for the taxonomic classification (with Naive Bayes classification) of my sample. The primer used is the ITS region (more precisely ITS1-u1 / ITS1-u2) and my sample is composed at least of fungi and plant species.

I found the RESCRIPt pipeline to get the NCBI database but I am not sure how to specify in the p-query what I am looking for.

The command I used is the following (I only show here a short part of the query I wrote):
qiime rescript get-ncbi-data
--p-query 'Blepharis ciliaris [ORGN] Nucleotide OR
Acorus calamus [ORGN] Nucleotide OR
Sambucus nigra [ORGN] Nucleotide OR
Viburnum lantana [ORGN] Nucleotide OR
Scilla siberica [ORGN] Nucleotide'
--o-sequences ncbi-refseqs-unfiltered_part1.qza
--o-taxonomy ncbi-refseqs-taxonomy-unfiltered_part1.qza

But the resulting taxonomy is completely wrong (I compared the taxonomy for the same feature ID I had with the UNITE and PLANiTS databases and it did not match).

I use the Qiime2-2021.2 version (installed by the Conda environment).

Thank you for reading me,

Alicia.

SoilRotifer · April 30, 2021, 4:26pm

Can you provide examples?

Also, not all databases necessarily follow the same taxonomic or nomenclatural rules. I would expect there to be differences. For more details see:

I'd also suggest reading: NCBI Taxonomy: A Comprehensive Update on Curation, Resources and Tools.

-Mike

SoilRotifer · April 30, 2021, 4:37pm

I forgot to mention you can try setting up a more generic queries... like:

Streptophyta:
--p-query "txid35493[ORGN] AND (ITS OR Internal Transcribed Spacer) NOT environmental sample[Title] NOT environmental samples[Title] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]"

or

Embryophyta:
--p-query "txid3193[ORGN] AND (ITS OR Internal Transcribed Spacer) NOT environmental sample[Title] NOT environmental samples[Title] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]"

etc...

More examples here:

-Mike

AliciaD · May 5, 2021, 8:00pm

Thank you for your help but I am still stuck with this command.

I can't provide examples because I've already deleted them but I had taken species present in the taxonomy results obtained with the UNITE dataset (and the naive-bayes classification) and I had searched the NCBI data of these species and performed the same classification to see if I could find them also (at the same feature ID) in the taxonomy results. And I didn't have the species at all at the same feature ID so I concluded that maybe I didn't have the right sequence to do the taxonomy assignment.

I try the generic query with a list of species instead of what you wrote in your request but as I have about 1900 species to identify, and the command doesn't accept long queries, it might take a long time (I can't put many species at the same time in the query).

So I tried this command where I put taxon ID of my species in a txt file (I just choosed 6 taxon IDs of my list to try this command). Here is the command :

qiime rescript get-ncbi-data \

--p-query 'txid3193[ORGN] AND (ITS OR Internal Transcribed Spacer) NOT environmental sample[Title] NOT environmental samples[Title] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]'\
--m-accession-ids-file txid_accessions.txt \
--p-n-jobs 5 \
--o-sequences ncbi-refseqs-unfiltered_all.qza \
--o-taxonomy ncbi-refseqs-taxonomy-unfiltered_all.qza

And this error appears :

"Plugin error from rescript:

Partial download. Expected 6 records, but got 1.
The following ids were missing: 100290, 100277, 100279, 100302, 1000418, 10028"

Does this mean that the data does not exist on NCBI or that there is still a problem with the query? (or maybe I have to change taxon ID and put another accession ID).

best regards,

Alicia

SoilRotifer · May 5, 2021, 9:40pm

Yes, note the flag is labeled --m-accession-ids-file, thus only takes GenBank accessions, not taxonomy ids.

You can update the query as follows:

--p-query '(txid3193[ORGN] OR txid100290[ORGN] OR txid100277[ORGN] OR txid100279[ORGN] OR txid100302[ORGN] OR txid1000418[ORGN]) AND (ITS OR Internal Transcribed Spacer) NOT environmental sample[Title] NOT environmental samples[Title] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]'

Or you can simply grab all of the Magnoliopsida by just using txid3398[ORGN]. Also note that txid10028[ORGN] is chordate.

So, why not download all the main plant groups by using the txids of the group rather than the txid for each species? In my initial suggestion txid3193[ORGN] is all of Embryophyta, you can also simply grab all the Viridiplantae txid33090[ORGN] too. Or break these up into separate queries by broad taxonomic groups and the merge the output using the merge-taxa and merge-seqs commands.

AliciaD · May 5, 2021, 10:46pm

Thank you for your help !

I wanted to test this command first to see what results I could get and also to familiarize myself with qiime2 and RESCRIPt (being new to metabarcoding DNA analysis).
Also, I had to wait until 9pm (US time) to run the command with all the plants as the tutorial asks.

Thank you again

Alicia

SoilRotifer · May 6, 2021, 2:34pm

Great. Keep us posted!

AliciaD · May 12, 2021, 8:49am

I have created my own taxonomy with the viridiplant txid and, as my sample contains both fungal and plant species, I have used an NCBI Bioproject to retrieve the ITS sequences of the fungi instead of using the UNITE database (and merging the plant and fungal databases) and the classification with the Naive Bayes classifier gives good results.

Thank you for your help,

Alicia D

Nicholas_Bokulich · May 12, 2021, 11:09am

Glad you got it working @AliciaD !

Would you mind sharing your commands here (including merging the plant and fungal databases)? Then other forum readers could follow in your footsteps later on

I have been planning to put together an NCBI ITS tutorial using RESCRIPt and post it on this forum at some point... you are very welcome to write this too if you are interested

AliciaD · May 17, 2021, 8:53pm

Here are my commands : NCBI_custom_database_creation.txt (3.5 KB)

I hope it will be useful to others