And I am currently taking all the abundance table ClosedTable95.qza into R for data manipulation.
I was wondering if there was a way of assigning all ‘non-hits’ a default name for further analysis, as my reference database is only a small proportion of the expected sequences (15%) and in order to be correctly formatted for ‘downstream analysis’ any non hits must be assigned something like “No_blast_hit”, is this possible using open-reference sequencing?
EDIT: Adding an identifier tag within non-hit sequences would also be fine, as I could change them all in R later.
EDIT2: I have found a way within R (changing all feature IDs with over 25 characters to No_Hits) but a Qiime alternative if known would be good for methods not using R manipulation.
Using closed-reference OTU picking to assign taxonomy is essentially just assigning the top hit taxonomy. I do not recommend that approach as discussed elsewhere, and would instead recommend assigning taxonomy subsequently using the vsearch classifier available in q2-feature-classifier. This classifier has a perc-identity parameter that you can adjust for the same purposes. The consensus taxonomy assignment provided by that classifier will also determine whether other near hits exist, in which case the top hit may not be a reliable classification. However, the consensus assignment can also be disabled by setting maxaccepts to 1, if you prefer to just grab the top hit.
Additionally, that classifier would report unclassified sequences as unclassified... hence no workarounds would be required.
Using the vsearch classifier (rather than OTU picker) would be the correct way to do this in QIIME2, and provides the necessary output.
You may also be interested in exclude-seqs (see also this tutorial). That would allow you to just remove all sequences that do not match a reference database above a specified percent identity. You can use this to split your feature table into two — e.g., to analyze these data in separate batches (e.g., classify diatoms with a diatom database and analyze non-diatoms separately downstream) — or just to discard other sequences (e.g., if you do not want diatom sequences, just get rid of them).
It's not just duplicates — your sequences probably won't be a perfect match so there may be many hits that are near matches within a similar percent identity. But theoretically, yes, you could do things this way if you really just want the top hit. Even then, I'd still recommend using classify-consensus-vsearch as that assigns a proper taxonomy to all sequences without needing to do a workaround.
Really? If none of your sequences are being classified, we should examine the parameters you are using... it sounds like maybe the parameters need to be tuned for your specific data (these methods have been tuned for bacteria and fungi, not diatoms! which may need a bit of tweaking)
Unfortunately most of the problems seem to arise from this methods reliance on it being similar to a previous method (microscopy based).
I will get working on it when I get into work tomorrow! Hopefully with some parameter altering I can get it working, if not I’ll come back and ask for more help, thank you very much!