Import RDP classifier output taxonomy file into qiime2

YuZhang · July 27, 2021, 4:24am

Dear all,
based on my post https://forum.qiime2.org/t/strange-taxanomy-based-on-silva-database/20031?u=yuzhang
I aim to use another database to assign my rep seq.
First, I exported my rep seq from qiime2.
Second I used the RDP web toolhttp://rdp.cme.msu.edu/classifier/classifier.jsp to assign the seq, then get a txt file like below:

Next, I want to chang the format to qiime2 like. But it has more than seven levels, and a specie has 2 assigned seq.

So how do I deal with this problem?

YuZhang · July 27, 2021, 5:15am

It seems like it has suborder and subclass.

Nicholas_Bokulich · July 27, 2021, 6:00am

Hi @YuZhang ,

The consensus in that post was to try using the SILVA 132 release. Why not do that? As noted in that post by @SoilRotifer , you can use RESCRIPt to download and automatically format the SILVA 132 release.

If you want to switch to RDP, why not import the RDP sequences/taxonomy into QIIME 2 and classify with q2-feature-classifier ? This might be an easier task (re-formatting RDP to fit the standard taxonomy format accepted by QIIME 2) than parsing the RDP classifier outputs.

Good luck!

YuZhang · July 27, 2021, 7:12am

Thanks for your attention.
First, I dont use the silva 132, because I looked up some articles and burkholderia-Caballeronia-Paraburkholderia still appeared when using 132 version，such as this article:Redirecting.
Second, I indeed overlooked the second way you advised, because I browsed the forums and it seems difficult to import it. I will try it.

Nicholas_Bokulich · July 27, 2021, 7:22am

Okay I am not sure what release that label first appeared in — it is just indicating that the genus is unresolvable/contested between those 3 genera, so they concatenated the labels.

If you just want a different database, not specifically RDP, you could use NCBI-refseqs 16S, see the tutorial on this forum for using RESCRIPt to automatically download the refseqs 16S database. This way would have not importing errors.

We do also plan to add support in RESCRIPt for automated download/formatting of the RDP database in the future, but I do not have an ETA on that.

Good luck!

YuZhang · July 27, 2021, 7:26am

Thanks sir, I aim to try the NCBI-refseqs 16S first.

YuZhang · July 27, 2021, 8:31am

Hi，sir. When I perform the code below:
qiime rescript get-ncbi-data \

--p-query '33175[BioProject] OR 33317[BioProject]' \
--o-sequences ncbi-refseqs-unfiltered.qza \
--o-taxonomy ncbi-refseqs-taxonomy-unfiltered.qza

It happens this error.

What's wrong?

Nicholas_Bokulich · July 27, 2021, 9:25am

please check the forum archive, this is due to incorrect installation/compatibility and has been answered elsewhere:

YuZhang · July 27, 2021, 2:25pm

Sir，thanks for your attention. I managed to use the NCBI database, but appeared a new problem.
I want to retain Bacteria only. So I run the code below:

qiime taxa barplot \
--i-table table-dada2-4.qza \
--i-taxonomy ncbi-taxonomy-4.qza \
--m-metadata-file  metadata.txt \
--o-visualization ncbi-bar-plots-4.qzv \



qiime taxa filter-table \
--i-table table-dada2-4.qza \
--i-taxonomy ncbi-taxonomy-4.qza \
--p-include k__Bacteria \
--o-filtered-table ncbi-Bacteria-dada2-table-4.qza \

# filter sequence
qiime taxa filter-seqs \
  --i-sequences rep-seq-dada2-4.qza \
  --i-taxonomy ncbi-taxonomy-4.qza \
  --p-include k__Bacteria  \
  --o-filtered-sequences ncbi-Bacteria-sequences-4.qza \

qiime feature-table summarize \
--i-table ncbi-Bacteria-dada2-table-4.qza \
--o-visualization ncbi-Bacteria-dada2-table-4.qzv \
--m-sample-metadata-file metadata.txt \


qiime feature-table tabulate-seqs \
--i-data ncbi-Bacteria-sequences-4.qza \
--o-visualization ncbi-Bacteria-sequences-4.qzv \

qiime taxa barplot \
--i-table ncbi-Bacteria-dada2-table-4.qza \
--i-taxonomy taxonomy-4.qza \
--m-metadata-file metadata.txt \
--o-visualization ncbi-Bacteria-taxa-bar-plots-4.qzv \

But the bar plot still retained d__Archaea etal.

What is wrong ?

And I chaned the k__ to d__ , it didn’t work.

SoilRotifer · July 27, 2021, 3:31pm

Use --p-exclude Archaea Eukaryota.

However, I'd advise against removing these groups, as you'll want to keep off-target / outgroup taxa in your database. Otherwise you'll likely identify many taxa incorrectly as "d__Bacteria; ;", when in fact they may be Archaea or Eukaryota.

See:

YuZhang · July 28, 2021, 12:59am

Thanks,sir. I'm still a little confused. Even if I retain Archaea Eukaryota, I would still only focus on the bacteria in the downstream analysis, so I would still filter out the archaea in the downstream analysis (I used the phyloseq. How is this different from the current deletion

YuZhang · July 28, 2021, 1:31am

Hi Nicholas，an Another question, I found that NCBI database is much smaller than SILVA and RDP, so if I classified 16s to this database, and wrote an article, will the reviewer question it?

SoilRotifer · July 28, 2021, 2:25am

Correct.

The idea is to correctly classify what is and is not Bacteria. Again, if there are no outgroup taxa in your reference database, then you might incorrectly over-classify sequences as being Bacteria when in fact they are not Bacteria. Then you will have greater confidence that you are removing (or retaining) the appropriate data.

This is no different then removing chloroplast and mitochondria sequences after you classify your reads, as shown in the filtering tutorial. It is usually better to identify everything and then filter your table based on what you need for your analyses.

SoilRotifer · July 28, 2021, 2:43am

If you read through this tutorial, you'll see that this focuses on downloading from the RefSeq target loci data. Specifically, if you read under the "Bacteria and Archaea: 16S ribosomal RNA project" section, you'll see that this data only contains sequences from "...bacteria and archaea type materials."

Unlike the other larger references databases (i.e. SILVA, GTDB, RDP,...) that contain a mix of environmental sequence data, type material, etc...

YuZhang · July 28, 2021, 3:03am

I am new to this field. My DNA sequence is from soil, so is it not suitable to use NCBI database to process 16s data?

SoilRotifer · July 28, 2021, 3:13am

Not necessarily. It depends on what your goals are. But in my general experience you might identify a broader range of taxa with SILVA and GTDB. Or you can simply use all of the databases and see if there is a general consensus of which taxa you can constantly identify.

YuZhang · July 28, 2021, 3:27am

Thanks sir. I want to use RDP and silva database. But, as I said above, silva appeared burkholderia-Caballeronia-Paraburkholderia. And Qiime2 does not seem to support RDP. So,I used NCBI.But if I classified 16s from soil to this database, and wrote an article, will the reviewer question it?

yileiwu · July 28, 2021, 6:22am

This is a python script to parse the RDP Unaligned file. please don't hesitate to contact me If you encounter any bugs.

YuZhang · July 28, 2021, 8:01am

Thanks yilei，I'm not good at Python. Do you have R script about these step？

YuZhang · July 29, 2021, 12:54am

Thanks. But I still have some questions.
My previous analysis procedure was to remove all categories other than bacteria after the classification, as shown in the code below. But after listening to your explanation, I am still confused and don't know how to operate it. I want to do downstream analysis in qiime2, such as alpha diversity. Thus I should focus on the k_Bacteria. If I didn't delete Archaea Eukaryota, it will get a incorrect result. What should I do? And I filter the seqs according to filtering tutorial after classify. So,what I should do to filter sequence correctly ?
My understanding is that it has something to do with the categories contained in the database? If the database contains more categories, will it be more credible to remove unwanted categories? But the current database, which usually contains bacteria and archaea together, such as RDP and NCBI that I used, does not ensure that the deletion is correct?

Trian classifier

        qiime rescript evaluate-fit-classifier \
            --i-sequences ../database/ncbi_refseqs/ncbi-refseqs.qza \
            --i-taxonomy ../database/ncbi_refseqs/ncbi-refseqs-taxonomy.qza \
            --o-classifier ../database/ncbi_refseqs/ncbi-refseqs-classifier.qza \
            --o-evaluation ../database/ncbi_refseqs/ncbi-refseqs-classifier-evaluation.qzv \
            --o-observed-taxonomy ../database/ncbi_refseqs/ncbi-refseqs-predicted-taxonomy.qza \

        qiime feature-classifier extract-reads \
          --i-sequences ../database/ncbi_refseqs/ncbi-refseqs.qza \
          --p-f-primer GTGCCAGCMGCCGCGGTAA \
          --p-r-primer CCGTCAATTCCTTTGAGTTT \
          --p-n-jobs 5 \
          --p-read-orientation 'forward' \
          --o-reads ../database/ncbi_refseqs/classifier/ncbi-515F-907R-ref-seqs.qza \

        qiime feature-classifier fit-classifier-naive-bayes \
          --i-reference-reads ../database/ncbi_refseqs/classifier/ncbi-515F-907R-ref-seqs.qza \
          --i-reference-taxonomy ../database/ncbi_refseqs/ncbi-refseqs-taxonomy.qza \
          --o-classifier ../database/ncbi_refseqs/classifier/ncbi-515F-907R-classifier.qza \


        qiime feature-classifier classify-sklearn \
        --i-classifier ../database/ncbi_refseqs/classifier/ncbi-515F-907R-classifier.qza \
        --i-reads rep-seq-dada2-4.qza \
        --o-classification ncbi-taxonomy-4.qza \

        qiime metadata tabulate \
        --m-input-file ncbi-taxonomy-4.qza \
        --o-visualization ncbi-taxonomy-4.qzv \

        qiime taxa barplot \
        --i-table table-dada2-4.qza \
        --i-taxonomy ncbi-taxonomy-4.qza \
        --m-metadata-file  metadata.txt \
        --o-visualization ncbi-bar-plots-4.qzv \

Filter sequence

qiime taxa filter-table \
--i-table table-dada2-4.qza \
--i-taxonomy ncbi-taxonomy-4.qza \
--p-include d__Bacteria \
--o-filtered-table ncbi-Bacteria-dada2-table-4.qza \

# filter sequence
qiime taxa filter-seqs \
  --i-sequences rep-seq-dada2-4.qza \
  --i-taxonomy ncbi-taxonomy-4.qza \
  --p-include d__Bacteria  \
  --o-filtered-sequences ncbi-Bacteria-sequences-4.qza \


qiime feature-table summarize \
--i-table ncbi-Bacteria-dada2-table-4.qza \
--o-visualization ncbi-Bacteria-dada2-table-4.qzv \
--m-sample-metadata-file metadata.txt \


qiime feature-table tabulate-seqs \
--i-data ncbi-Bacteria-sequences-4.qza \
--o-visualization ncbi-Bacteria-sequences-4.qzv \

qiime taxa barplot \
--i-table ncbi-Bacteria-dada2-table-4.qza \
--i-taxonomy taxonomy-4.qza \
--m-metadata-file metadata.txt \
--o-visualization ncbi-Bacteria-taxa-bar-plots-4.qzv \