I am using rescript to make a custom database for the rbcl gene. I used the command get-ncbi-data to download rbcl sequences. I examined the taxonomy file and I wanted to remove data from that had assignments that are 'unclassified' or 'environmental' etc.
I performed a filter taxonomy step:
qiime rescript filter-taxa \
--i-taxonomy rbcl/ncbi-refseqs-taxonomy-unfiltered-rbcl.qza \
--p-exclude 'k__Synthetic and Chimeric' 'k__Environmental samples' 'k__Unassigned' \
--o-filtered-taxonomy rbcl/ncbi-refseqs-taxonomy-filtered-rbcl.qza
then I dereplicate my sequences:
qiime rescript dereplicate \
--i-sequences rbcl/ncbi-refseqs-unfiltered-rbcl.qza \
--i-taxa rbcl/ncbi-refseqs-taxonomy-filtered-rbcl.qza \
--p-mode 'uniq' \
--p-threads 8 \
--o-dereplicated-sequences rbcl/basic-rbcl-ref-seqs-derep.qza \
--o-dereplicated-taxa rbcl/basic-rbcl-ref-tax-derep.qza
resulting in the error:
Plugin error from rescript:
'DM462432.1'
I've seen this post, which tells me the error is likely because that it found a sequence with the ID but could not find a corresponding taxonomy in the taxonomy file.
My question is, How do I filter out taxonomic ranks that I do not need once in taxa and seq files? Should I just do it after the I've dereplicate and culled sequences and before building the final classifier? Or can this only be dealt with at the download step in my Entrez command (as is done here).
many thanks,