rescript filter-taxa and dereplicate

buzic · October 24, 2023, 3:10pm

I am using rescript to make a custom database for the rbcl gene. I used the command get-ncbi-data to download rbcl sequences. I examined the taxonomy file and I wanted to remove data from that had assignments that are 'unclassified' or 'environmental' etc.

I performed a filter taxonomy step:

qiime rescript filter-taxa \
--i-taxonomy rbcl/ncbi-refseqs-taxonomy-unfiltered-rbcl.qza \
--p-exclude 'k__Synthetic and Chimeric' 'k__Environmental samples' 'k__Unassigned' \
--o-filtered-taxonomy rbcl/ncbi-refseqs-taxonomy-filtered-rbcl.qza

then I dereplicate my sequences:

qiime rescript dereplicate \
--i-sequences rbcl/ncbi-refseqs-unfiltered-rbcl.qza  \
--i-taxa rbcl/ncbi-refseqs-taxonomy-filtered-rbcl.qza \
--p-mode 'uniq' \
--p-threads 8 \
--o-dereplicated-sequences rbcl/basic-rbcl-ref-seqs-derep.qza \
--o-dereplicated-taxa rbcl/basic-rbcl-ref-tax-derep.qza

resulting in the error:

Plugin error from rescript:

  'DM462432.1'

I've seen this post, which tells me the error is likely because that it found a sequence with the ID but could not find a corresponding taxonomy in the taxonomy file.

My question is, How do I filter out taxonomic ranks that I do not need once in taxa and seq files? Should I just do it after the I've dereplicate and culled sequences and before building the final classifier? Or can this only be dealt with at the download step in my Entrez command (as is done here).

many thanks,

SoilRotifer · October 24, 2023, 6:27pm

Hi @buzic,

You're almost there. The easier way, is to follow this approach.

That is:

qiime taxa filter-seqs \
  --i-sequences rbcl/ncbi-refseqs-unfiltered-rbcl.qza \
  --i-taxonomy rbcl/ncbi-refseqs-taxonomy-unfiltered-rbcl.qza \
  --p-exclude 'k__Synthetic and Chimeric' 'k__Environmental samples' 'k__Unassigned' \
  --o-filtered-sequences rbcl/ncbi-refseqs-filtered-rbcl.qza

Then you can run the rescript dereplicate command. In this case, it is okay to have more references in your taxonomy file than within your sequence file. But, if you'd like to keep things in-sync you can run the following command to remove taxonomic references that are not within your sequence file:

qiime rescript filter-taxa \
    --i-taxonomy rbcl/ncbi-refseqs-taxonomy-unfiltered-rbcl.qza \
    --m-ids-to-keep-file rbcl/ncbi-refseqs-filtered-rbcl.qza \
    --o-filtered-taxonomy rbcl/ncbi-refseqs-taxonomy-filtered-rbcl.qza

Now you should be good to go!

SoilRotifer · October 24, 2023, 6:39pm

Hi @buzic,

I forgot to mention, if you need to fix or alter specific taxonomic ranks, e.g. deal with mis-annotations you can use edit-taxonomy.

buzic · October 25, 2023, 7:24am

Thats perfect! Thank you!

system · November 25, 2023, 1:25pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.