Reference database curation (qiime rescript evaluate-fit-classifier )

Guillermo_U · March 14, 2023, 4:55pm

Hello Qiime2 community!

So I have been following the tutorial Using RESCRIPt's 'extract-seq-segments' to extract reference sequences without PCR primer pairs. in order to curate my reference database. Theoretically everything went okay. However, now I have tried to create a classier based on my curated database and the code it's crashing. It tells me that I do not have enough memory. However, I am running the analysis in quite a powerful laptop (output after free -h --giga command):

I am running qiime2 on WSL2 and I installed the latest version of it . Is there anything that I could do to run the analysis ? or Do you recommend to use any computing facility ?

The code that I have used is the following:
Reference library.py (7.1 KB)
dada2_pipeline.py (790 Bytes)

Since I could not train my classifier I opted to try with alignment methods like Vsearch, however the results were weird. Most of my rep sequences were not assign to any taxonomic rank. Why is this happening? For this I used the reference database and the taxonomy associated that I obtained in the tutorial mentioned above. Can both methods be combined? Only 6 out of 643 sequences were assigned to a taxonomic rank. Moreover, those sequences were assigned to ferns and in my study I expect angiosperms from Africa.

Thank you very much !!!

Guillermo_U · March 14, 2023, 4:56pm

Here are the visualization for the reference database:
ITS-2-gh-extracted-seq-segments-derep-cull-keep-eval-02.qzv (323.2 KB)
ITS-2-gh-extracted-tax-segments-derep-cull-keep-eval-02.qzv (7.8 MB)

SoilRotifer · March 14, 2023, 9:22pm

You likley do not have enough RAM. Most custom databases that I've trained required upwards of 24 - 64 GB of RAM.

...

I noticed that, in your genbank download command, you are using txid3398[ORGN]. This refers to Magnoliopsida (flowering plants) . That is, your database does not contain any ferns. I think you'd want Polypodiopsida, txid241806[ORGN].

Guillermo_U · March 15, 2023, 12:05pm

Hi @SoilRotifer.

Thank for your answers.

Regarding the classifier I will use the computing facilities of my university and I will tell them that I need the RAM you proposed
Regarding the taxonomic assignment:

I think I uploaded the incorrect file but the workflow it is the same. So I am interested in flowering plants. In our study we are trying to reconstruct the network of interactions between plants and their pollinating birds (i.e. African sunbirds). Thus, we sequence pollen collected from the bird's bill. Hence, I expect African plants in my results. For this purpose I downloaded from Genebank Streptophyta txid35493 then I followed all the curation steps from the tutorial. Since I did not have enough computing power I perform the taxonomic assignment with VSearch, with the following code:

qiime feature-classifier classify-consensus-vsearch
--i-query rep-seqs-ITS-2-dada_2.qza
--i-reference-reads ITS-2-gh-extracted-seq-segments-derep-cull-keep-02.qza
--i-reference-taxonomy ITS-2-gh-extracted-tax-segments-derep-cull-keep-02.qza
--p-perc-identity 0.97
--p-no-top-hits-only
--p-threads 10
--o-classification taxonomic_asingment_3.qza
--o-search-results taxo_results_3.qza
--verbose > taxonomic_assignment_3.txt

This is the correct file to produce the curated reference library:
Reference library.py (7.3 KB)

However, only six out 643 rep-seqs were assigned to some taxonomic rank, and these were ferns. So these results are totally meaningless for us. Any ideas why is this happening ? Can the curated database from REscript be combined with feature-classifier classify-consensus-vsearch?

I hope I have now explained myself better

Thank a lot again !!!

SoilRotifer · March 15, 2023, 1:37pm

Thank you for the clarification @Guillermo_U!

I think you've been quite clear all along.

Have you tried BLAST on any of these sequences? If so, do they match the assignments of the sequences that were identified by the classifier? For the others that were not assigned, what do they return as?

Would you be able to private DM me a portion of your data and the sequence and taxonomy files you are using to classify your reads? You may have to link them via Dropbox, or similar service as they might be too big to share via private DM. I'd like to look into this further.

I am wondering if we need to increase the number of iterations, and/or the --p-perc-identity of extract-seq-segments? There have been times I've had to run anywhere from 3-5 iterations. I only performed two iterations in the tutorial to keep things short and simple.

Guillermo_U · March 16, 2023, 11:05am

Hi @SoilRotifer !

I will BLAST the sequences individually to compare the result as you suggest!! Also I will adjust the pipeline as you suggested increasing the perc-identity parameter of extract-seq-segments!
I will prepare the files and I will send to you !

Thank you very much for your kind help