I have 16S rRNA data (V5-V6) and I did the taxonomy classification using the below command. And I am almost sure about the accuracy of my data but I am not sure what to do with the unassigned taxa that usually you would get with 16S amplicon sequencing. It seems the easiest way is to delete these sequences but when I blast some of them they are 90-96% (because the --p-perc-identity was 97%) similar to some bacteria. What would be the best way to deal with these sequences? Should I keep it when I calculate my alpha and beta diversity indices? or should I filter them before doing so?
Please see the attached pictures.
Typically, I like to remove any sequences that cannot reliably assigned to phylum level, while also removing chloroplast and mitochondrial sequences. We outline some of this approach here. Note:, you may need to select Command Line (q2cli) from the drop-down menu at the top of the page to view the command line version of this command.
What is the query coverage? This is important as you can have a 100% identity over 50% of the matching sequence, which is not informative. Remember BLAST (basic local alignment search tool) shows the best "local alignments" not global alignments.
Also, make sure that you are not simply hitting other 'environmental' or 'uncultured' sequences, as they may be unreliable. BLAST has an option to ignore these.
Thank you so much for your reply. That is what I was doing in the past (removing any sequences that cannot be reliably assigned to the phylum level). But I thought this might underestimate the community diversity as some of them are identified as bacteria with no information or in my case unassigned with high similarity to some of the bacterial species.
What is the query coverage?
They are different (99, 100, 97, etc).
I agree with you and I would delete any unidentified taxa that did not have phyla information (or were chloroplast or mitochondria).
Another question. The
--p-exclude 'p__;,Chloroplast,Mitochondria' \
is deleting taxa "Cyanobacteriia; o__Chloroplast" or only chloroplast at the domain with chloroplast would be deleted?
d__Bacteria; p__Cyanobacteria; c__Cyanobacteriia; o__Chloroplast; f__Chloroplast; g__Chloroplast
Any feature with the "Chloroplast" within the taxonomy string will be removed. Remember, taxonomy also provides some information on evolutionary history. Chloroplasts evolved from cyanobacteria, just as mitochondria evolved from proteobacteria. I often worry when I see a feature that is only classified as "cyanobacteria", as it is unclear if it is really cyanobacteria or plant material. Often I'll make a phylogeny with other plant and cyanobacteria sequences to confirm. But I think you are safe most of the time simply removing features that are mitochondria or chloroplasts.