Annotation and Classifier question

Thank you for sharing this @ankurnaqib

I think you can skip the filter-seqs-length-by-taxon step. Or you can simply use filter-seqs-length to select the smallest size you expect from your amplicon. Also, there are no Eukaryotes in the GTDB files. I'd try skipping this and see what happens. But definitely keep the cull-seqs step. Anyway, I think the length filtering, with the specified options, might be removing too much reference data.

Another option, is to try using the qiime rescript extract-seq-segments approach. That is, use your data, or quality sequence data that you trust, to align and extract similar reads from the GTDB reference sequences. This will obviate the need for length filtering.

There is no need to export and re-import data into QIIME 2. You can simply use the built-in functions qiime taxa filter-seqs ... and qiime rescript filter-taxa ... to remove unwanted sequences and taxonomy from each of the files. Also, you can make use of qiime rescript edit-taxonomy ... to edit problematic taxon labels. This way, everything you do is recorded in provenance and it is clear to others what you've done. Heck if you are able to confidently assert that a taxonomic label should be something else, then you should be able to simply replace the label.

Keep in mind, much of the SILVA database tutorial is not meant to be an SOP. Just a series of examples that were organized for simplicity. As mentioned throughout the tutorial, alter the curation steps as needed for your use case.

I'd strongly suggest that you also try assigning your taxonomy to SILVA 138.2. Just as a matter of sanity checking some taxonomic assignments. I will say that, for some projects, I will run my data through SILVA first, to identify and remove host, mitochondria, and chloroplast reads. Remove these reads from the data, then re-classify using GTDB. GTDB, in my experience, will incorrectly assign bacterial/archaeal taxonomy to things that are very clearly mitochondria, etc...