Import reference sequences database and train classifier for mcrA sequences

baehsung · March 1, 2022, 5:54pm

Hi everybody,

I am in seq analysis of mcrA that is a functional gene related to methanogenesis.
The mcrA seqs were filtered through DADA2, getting feature table, rep-seqs already.

Now I want to analyze the taxonomy of them. I would appreciate if somebody give me tips; more specifically, how to get reference seq database and make the classifier for mcrA.

Thanks,

Hee-Sung

SoilRotifer · March 1, 2022, 7:32pm

Hi @baehsung,

I'd recommend trying out our super awesome RESCRIPt plugin.This plugin will enable you to construct and curate your own marker gene reference set. You can start by reading through our get-ncbi-data tutorial. For another example checkout the notebook for making a 12S rRNA reference set.

-Cheers!
-Mike

baehsung · March 3, 2022, 5:39pm

Thanks Mike!

I read the tutorial that you suggested. Even though, I could not understand all, but try to download mcrA sequeces from NCBI first. Recently i updated qiime2 to version 2022.2. Do I need to reinstall RESCRIPt in the updated qiime2? I had installed RESCRIPt in the old version of qiime2 (2021.11).

Hee-Sung

SoilRotifer · March 3, 2022, 6:37pm

I'd start with the 12S notebook and simply replace the gene name terms (there may be a few different terms to search for..). If you are able to get that to work we can help you refine your reference sequence database preparation.

Yes, plugins must be installed for each qiime environment to be available.

Keep us posted!

baehsung · March 3, 2022, 11:24pm

Hi Mike,

I tried to retrieve mcrA seqs from ncbi Entrez with key words "methyl coenzyme m reductase alpha subunit mcrA" and "euryarchaeotes", which selected 1637 seqs.
if i want to retrieve those seqs, how to make --p-query [text]?

When i used --p-query ((methyl coenzyme m reductase alpha subunit mcrA) AND "euryarchaeotes"[porgn:__txid28890] as shown in the query box, it made a problem with the command Got unexpected extra argyment..

SoilRotifer · March 4, 2022, 11:42pm

You need to place everything within quotes like so:

--p-query ' ((methyl coenzyme m reductase alpha subunit mcrA) AND "eukaryotes" ... '

When you have quotes as part of your query search term you have to use a different quote type to encompass the entire search string. In this case I am using single-quotes: ' so that we can make use of the double-quotes " within the search string.

-Mike

baehsung · March 8, 2022, 6:40pm

Thanks Mike,

I entered pligin as below;

qiime rescript get-ncbi-data
--p-query ' "methyl coenzyme m reductase alpha subunit mcrA" AND "euryarchaeotes" '
--p-ranks domain .....species
--p-rank propagation
--o-sequences ....
--o-taxonomy ....
--verbose

and Plugin error from rescript and attached picture.

do you have a comment for me to solve this error problem?

Best regards,

Hee-Sung

SoilRotifer · March 8, 2022, 7:26pm

Hi @baehsung,

The issue is occuring due to an incorrect query statement. I suggest you read NCBI's documentation on composing queries. I was successfully able to run this command locally (below):

qiime rescript get-ncbi-data \
	--p-query '(methyl coenzyme m reductase alpha subunit OR mcrA) AND txid28890[ORGN]' \
	--p-ranks domain superkingdom kingdom phylum class order family genus species \
	--p-rank-propagation \
	--o-sequences mcrA-seqs.qza \
	--o-taxonomy mcrA-tax.qza \
	--verbose

If you are searching for a particular taxonomic group I suggest you always provide a txid statement, e.g. txid28890[ORGN] which basically means, "return Euryarchaeota". You can search the NCBI Taxonomy resource to determine the txid numbers associated with a given taxonomic group.

Also note the OR statement contained within the (), and the AND statement. Breaking down the query, we are basically saying that we'd like records that :

( are annotated as either methyl coenzyme m reductase alpha subunit OR mcrA)
AND the record must be from txid28890[ORGN] # i.e. Euryarchaeota

baehsung · March 9, 2022, 5:44pm

Hi Mike,

I am so happy that I could get get the mcrA seqs and tax with your advice, thanks very much for this.

The seqs that I retrieved are including the seqs from uncultured strains, which may interrupt the classification of my seqs. Could I exclude those ones by changing the code of --p-query?

Cheers,

Hee-Sung

SoilRotifer · March 9, 2022, 6:45pm

Yes you can exclude items from a search using commands like NOT. See the 12S notebook example and the NCBI documents I referred to earlier in this thread.

-Mike

baehsung · March 10, 2022, 12:54am

thanks Mike,

I did it as advised by you !!

Hee-Sung

system · April 10, 2022, 6:55am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.