Enrich a standard database

Steph_Hp · November 20, 2021, 12:56am

Hello everyone,
Thanks for taking your time to help me with this issue!
I'm working with a data coming from ITS barcoding and I wanted to use UNITE database. I had no problem with this! But I want to identify some specific group of fungi that I can download from NCBI database and others that I already have from a previous work and are not that common in standard databases as UNITE.
So, in someway I'd like to enrich or add these new sequences to the UNITE database and trained my classifier in order to assign taxonomy again. Is that possible and If so, how can I do this?

colinbrislawn · November 20, 2021, 3:48am

Hello Stephanie,

Sounds like a cool project and a great way to build off previous work!

I don't know of an easy way to do this within Qiime2...

But it is possible!

Here are two ways I would do this:

Use a 'mini-database'
After reporting the results from the existing database, I would run a classifier or alignment against only the taxa of interest from the previous work. For ASVs that matched this 'mini-database,' I would offer this as an 'alternative' or 'improved' classification. During the discussion, I would talk about these ASVs using their new classification.
Build a new database
I like your description of an 'enriched' database. The only way I know to build this is to add these sequences at the very start of the database building process... which ends with a new database. Fortunately, we have the RESCRIPt plugin for building databases and an awesome tutorial!

Databases are hard! I'm also interested in an elegant way to enrich a database like you described!

If you want to try one of those options, we would be happy to help!

Nicholas_Bokulich · November 20, 2021, 9:39am

Hi @Steph_Hp ,

Out of curiosity, what is the fungal group you are interested in?

@colinbrislawn took the words out of my mouth! I just want to add some more details to this comment:

They are hard! And the main reason why merging too databases is hard is because they might use different taxonomic nomenclatures/formats. So you cannot simply glue them together and expect them to work.

Glueing them together is the easy part... RESCRIPt can be used to query sequences from NCBI; q2-feature-table has a method for merging sequences (e.g., to merge UNITE and NCBI seqs); and q2-feature-table and RESCRIPt both have methods for merging taxa. But the taxonomy formats would not be compatible, so would lead to very strange results downstream with some taxonomy classification methods. RESCRIPt does also have a method for editing taxonomies so you could use this to re-format the NCBI taxonomy strings to be compatible with UNITE, but this might be complicated depending on how many different sequences you plan to add...

So @colinbrislawn 's suggestion of a "mini-database" for 2-step classification (first UNITE, then reclassification of the clade of interest using an NCBI mini-database) is probably the easiest and most transparent approach!

Steph_Hp · November 20, 2021, 6:41pm

Thanks Collin for your nice response! You guys are always so quick and nice in this forum!
The option 1 of the 'mini-database' is an easier way, I think.So I prefer to try RESCRIPT plugin and then If for some reason I can't or it is hard for me I'll continue with option 1.
So, is it possible to use RESCRIPT to enrich the UNITE database? I mean with this, add these sequences when import the database?

Steph_Hp · November 20, 2021, 6:47pm

Hi Nicholas! Thanks also for your response. I see, so the main problem or issue it could be the format of the taxonomy.... I see if I can do something about it. I think, in the past I saw a tutorial of Devon where he used databases of BOLD and NCBI for COI taxonomy assignmente. Maybe I could re-check this. But well, I guess the easy way of the 'mini-database' is a reasonable option.
I want to identify some fungi of the Gomphales order which are in high abundance in my study site. But with the standard databases I can't reach to identify them!

Thank you guys for your suggestions. If you have other ideas or suggestions, please I'll be more than happy to hear them!

Nicholas_Bokulich · November 21, 2021, 9:51am

Are there specific groups, or the entire order? I see a couple hundred Gomphales in the UNITE database, so I am guessing that it's certain groups? Or is the issue just that UNITE cannot distinguish Gomphales? (in which case the issue could just be that the ITS is not variable enough within this group to identify species)

If you are interested in adding specific groups, you could:

use RESCRIPt to download and format seq/taxonomy files for those specific groups from NCBI (this would take a complex query but you can use the RESCRIPt tutorial on this forum as a starting point — and check out the query directly on genbank before attempting to download with RESCRIPt to make sure that you are narrowing the query to the right group/gene/etc)
use RESCRIPt to filter by length/etc if necessary
use RESCRIPt to edit the taxonomy artifact (e.g., to adjust the labels if necessary so that the NCBI taxonomy strings match UNITE and the lineage labels are consistent with other Gomphales)
merge the seqs and taxonomies from UNITE and this mini-NCBI subset

but as said above it might be easiest to just follow step 1 and use that mini-database for reclassification of Gomphales that you detect with UNITE but are unable to identify below order level.

Either way, a challenging task! Good luck!

Steph_Hp · November 22, 2021, 4:44pm

Hello Nicholas,

Thanks again for your answers.
Actually my sequences came from metagenomic data (i.e, I sequenced metagenomic eDNA). I'm testing various approaches to assign taxonomy, like kraken, FindFungi, metaphlan and others. But recently I found this paper : Metagenomic data reveal diverse fungal and algal communities associated with the lichen symbiosis | Symbiosis and these authors mapped their sequence data with a reference and then process them into Qiime2 and I found this way a little interesting since I couldn't identify the order I mentioned before.
Well, Actually I had data from species found in my site-study that is why I want to try it with RESCRIPt.
So, I got all the points you suggested before. But, for merging sequences and taxonomies? How do I do these? With cat or bash in terminal or there is another pluggin in QIIME2 that I can use?

Thanks again!

Nicholas_Bokulich · November 23, 2021, 4:07pm

Yes, see above — RESCRIPt can merge the taxonomies (FeatureData[Taxonomy] artifacts), and q2-feature-table can merge the sequence artifacts (FeatureData[Sequence])

you could also merge with cat before importing, but I would recommend the other way, since then your steps are preserved in QIIME 2 provenance so it is more transparent/possible to trace the steps that you used prior to classification.

Good luck!

Steph_Hp · December 14, 2021, 7:28pm

Thanks so much Nicholas for your nice and Clear answers!

system · January 15, 2022, 1:29am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.