Hi @alexkrohn,
This is often the case in eDNA surveys.
Have you performed any other QA/QC? I often try to perform the following in the given order:
For example, it is common to perform additional filtering of the data by:
- increase the de novo chimera detection by adjusting the
pooling_method
andchimera_method
of dada2 - run reference based chimera detection with uchime-ref on your DADA2 output.
- After you assign taxonomy, you can filter any taxa that do not have at least a phylum-level taxonomy, etc...
- To avoid any spurious taxonomy assignments to bad sequences, you can run quality-control exclude-seqs, to remove any reads that to not match to your curated reference to a high degree. Often you can use the same reference sequences that were used to generate your classifier. Assuming good curation was put into making the reference taxonomy & sequences for taxonomy assignment. That is you can remove reads that do not have at least a 90% identity and 90% query coverage, to your reference sequences. I'd likely not go too much higher than this, this way you can still catch potentially real, but unfamiliar taxa.
Often many reference reads are missed when only using primer pair extraction of GenBank data. Have you tried the rescript extract-seq-segments approach to maximize extraction of as many reference reads that span your amplicon region? Note the many repeated levels of data culling we do in that tutorial. There are also some additional QA/QC suggestions in the various RESCRIPt and other tutorials that you can apply too.
Very likely. At least it is a contributing factor.
Other than the filtering steps I outlined above... not much. Remember general sequence repositories like GenBank are not necessarily curated. They serve to archive sequences generated by scientists. Often what gets deposited by a researcher, is not that the researcher thinks it is. Unless the researcher updates the record themselves, that record will stay as is. This is why many other curated repositories like SILVA, GTDB, etc... exist. They mine the sequence archives, and curate them to the best of their abilities. This is why tools like RESCRIPt exist, to help users have a chance at curating the reference data as best they can.
Yeah, this is basically the issue of over-classification. Assigning to a more specific taxon group than you probably should.
The general rule is to never assume the data downloaded from any sequence repository or curated reference database is perfect. Thus, it's best to perform some curation on your end prior to using it to classify sequences.
I hope this helps.
-Mike