I am in seq analysis of mcrA that is a functional gene related to methanogenesis.
The mcrA seqs were filtered through DADA2, getting feature table, rep-seqs already.
Now I want to analyze the taxonomy of them. I would appreciate if somebody give me tips; more specifically, how to get reference seq database and make the classifier for mcrA.
I'd recommend trying out our super awesome RESCRIPt plugin.This plugin will enable you to construct and curate your own marker gene reference set. You can start by reading through our get-ncbi-data tutorial. For another example checkout the notebook for making a 12S rRNA reference set.
I read the tutorial that you suggested. Even though, I could not understand all, but try to download mcrA sequeces from NCBI first. Recently i updated qiime2 to version 2022.2. Do I need to reinstall RESCRIPt in the updated qiime2? I had installed RESCRIPt in the old version of qiime2 (2021.11).
I'd start with the 12S notebook and simply replace the gene name terms (there may be a few different terms to search for..). If you are able to get that to work we can help you refine your reference sequence database preparation.
Yes, plugins must be installed for each qiime environment to be available.
I tried to retrieve mcrA seqs from ncbi Entrez with key words "methyl coenzyme m reductase alpha subunit mcrA" and "euryarchaeotes", which selected 1637 seqs.
if i want to retrieve those seqs, how to make --p-query [text]?
When i used --p-query ((methyl coenzyme m reductase alpha subunit mcrA) AND "euryarchaeotes"[porgn:__txid28890] as shown in the query box, it made a problem with the command Got unexpected extra argyment..
You need to place everything within quotes like so:
--p-query ' ((methyl coenzyme m reductase alpha subunit mcrA) AND "eukaryotes" ... '
When you have quotes as part of your query search term you have to use a different quote type to encompass the entire search string. In this case I am using single-quotes: ' so that we can make use of the double-quotes " within the search string.
The issue is occuring due to an incorrect query statement. I suggest you read NCBI's documentation on composing queries. I was successfully able to run this command locally (below):
qiime rescript get-ncbi-data \
--p-query '(methyl coenzyme m reductase alpha subunit OR mcrA) AND txid28890[ORGN]' \
--p-ranks domain superkingdom kingdom phylum class order family genus species \
--p-rank-propagation \
--o-sequences mcrA-seqs.qza \
--o-taxonomy mcrA-tax.qza \
--verbose
If you are searching for a particular taxonomic group I suggest you always provide a txid statement, e.g.txid28890[ORGN] which basically means, "return Euryarchaeota". You can search the NCBI Taxonomy resource to determine the txid numbers associated with a given taxonomic group.
Also note the OR statement contained within the (), and the AND statement. Breaking down the query, we are basically saying that we'd like records that :
( are annotated as either methyl coenzyme m reductase alpha subunitORmcrA)
AND the record must be from txid28890[ORGN] # i.e. Euryarchaeota
I am so happy that I could get get the mcrA seqs and tax with your advice, thanks very much for this.
The seqs that I retrieved are including the seqs from uncultured strains, which may interrupt the classification of my seqs. Could I exclude those ones by changing the code of --p-query?
Yes you can exclude items from a search using commands like NOT. See the 12S notebook example and the NCBI documents I referred to earlier in this thread.