I want to remove Eukaryota from SILVA 138, following these instructions i ran:
qiime rescript filter-seqs-length-by-taxon \
--i-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza \
--i-taxonomy silva-138.1-ssu-nr99-tax.qza \
--p-labels Archaea Bacteria \
--p-min-lens 900 1200 \
--o-filtered-seqs silva-138.1-ssu-nr99-seqs-filt.qza \
--o-discarded-seqs silva-138.1-ssu-nr99-seqs-discard.qza
But Eukaryota is stiil being assigned with my sequences
Hi @kevin_SalOrt,
I'd highly recommend keeping the eukaryotes within the reference database for your classifier. Also, you do not need to make use of filter-seqs-length-by-taxon
if you do not want to. Remember, that tutorial is just an example of what you can do. Anyway, if you have some reads that hit those references, and are identified as eukaryotes, you can remove them, see below. These eukaryotic sequences act as "outgroups" or "decoys" to ensure that you are not erroneously assigning these sequences to Bacteria. Remember, reads can be assigned to something they are not simply because they are matching the closest representative within the database.
Once you have classified your reads, you'd follow this approach to remove Eukaryotes and organelle sequences, prior to your downstream analysis.
But if you'd like to remove these sequences prior to making your classifier, you can follow this approach. Again, for purposes of making a classifier I'd strongly suggest you leave the Eukaryote sequences. If your sequences are being identified as eukaryotes, then you likely have contamination... or simply have many eukaryotes within your environment.
3 Likes
Thanks, it was very helpful. So i may remove the contaminant sequences and keep on the remaining steps?
Yes. This is a quite common to do, i.e. remove host sequences etc... See the links I provided above.
1 Like