I ran the dada2 pipeline in QIIME2 and assigned taxonomy using SILVA on 16S water samples. In phyloseq, I was looking for the most abundant taxa in the dataset and realized that the first one corresponded to this particular ASV xx: D_0__Bacteria. When blasting this ASV, I found "uncultured bacterium clone". I was wondering what to do with it? Should I filter it? or keep it?
Thank you very much in advance for your advice,
Leïla
Hi!
Maybe it is not the best option, but I prefer to delete all ASVs that are assigned only to D_0__Bacteria since it is unclear for me what do to with it. I am afraid of misinterpretations this reads can cause since it can be a bunch of completely unrelated bacteria. Will be glad to hear different ways to deal with them
I tend to filter out things that only have bacterial designations, especially if you're using Silva. I think, though, like so many things in microbiome research, it's a judgement call. The questions I'd ask myself, though are things like
What's your confidence that this is a real organism vs a chimera? Are you willing to accept the possibility of a chimera to retain the organism?
Given the environment you're working in, how likely is it that you're seeing a truly novel taxon compared to the probability you're seeing noise? (I work in a well defined environment, so I tend toward "noise" but it may be different for your water samples)
How abundant are they? How uniform? Are they a high or low percentage or your sample?
What will you do with them analytically? How will you describe them in your paper/analysis/to your collaborators?
In my case, I think I would filter before going forward, describe the filtering, and continue with my day. But, others who work in other environments may have really different perspectives.
I'm not sure if I would filter it... At least not unless you can confidently answer these questions from Justine.
On one of my papers, we were working with a pretty strange environment, and our most common microbe wasn't getting classified down to the species or even genus level. And yet we kept it in because
It was the most common microbe during the first days of community formation and
We had metagenomics done already and were pretty sure it of it's taxonomy, even though SILVA was not finding it
If we had dropped it, our study would have looked very different!
I think it's interesting to talk about the microbes you identified, but it's also interesting to talk about an unknown microbe that were common, and could later by identified by other researches!
Hi,
if this is the most abundant taxon and corresponds to a particular ASV, then I would not filter it out. The question is what you are studying. If it is a comparative study, then it may be very relevant and important that the given organism is there or not in every samples, and how its abundance changes in response of treatment or collection time or geographical location, etc.
One thing is the identification of the given organism (or ASV or OTU), and another thing is whether you can tell its correct taxonomic classification.
The fundamental notion and basis of metagenomics is that the majority of bacteria can not be propagated or isolated and maintained in axenic cultures. Therefore unknown taxa can always be expected to be found.
On the other hand, dada2 is very rigorous in filtering out chimeras, so if dada2 has not discarded your ASV as chimera then there is a good reason to suspect that it is a correct sequence. Nevertheless, I do not really believe that we should rely on a 1-nucleotide depth of resolution. If you pool or collapse the ASVs with a threshold of 1 or 3% difference (which almost takes you back to the old OTU world) and THEN do the taxonomic classification, you may as well find some known taxon(s) in the pool.
I agree with @colinbrislawn and @Peter_Kos that it is worth examining the abundance and identity of this feature, following @jwdebelius's steps, before throwing out. At the end of the day, though, usually if a sequence does not classify to at least phylum level it is host DNA or other non-target DNA and should be thrown out after some investigation, as @jwdebelius and @timanix advise.
Check out the related topics on this forum — 95% of the time either this is due to human error (e.g., using the wrong reference database) or these BLAST to some sort of host or other non-target DNA, e.g., human or mouse DNA, and is either primer mismatch or possibly index hopping (i.e., cross-contamination from genomic samples sequenced in the same lab that use the same index/barcode sequences). Since you have water samples, it could be other environmental DNA that is being mis-amplified.
Since this is the most abundant ASV, I suspect you may have used the wrong classifier. What primers are you using, and what classifier did you use? Did you train your own or use one of the QIIME 2 pre-trained classifiers?
The genbank nt database contains a bit of junk — use the "exclude uncultured" option to filter these out and there is a very good chance you will find that this blasts to a non-bacterial species. If you find some very close hits, this is a very good indication that this is a technical error. If you don't get any good hits, then you could consider if this may actually represent a novel phylum... look very carefully at the alignment results and rule out the possibility that this may be chimera or have non-biological DNA (e.g., adapters) in the sequence.
This is true! especially as @leila is looking at water samples. However, the likelihood of finding an entirely novel phylum is rather unlikely unless if you are looking at a very under-characterized environment, so any time the classifier fails to reach phylum level you should be suspicious.
If the taxonomic classifier has trouble with the ASV, then collapsing into OTUs should have no effect whatsoever for sequences that do not classify at phylum level. Collapsing will lose some resolution and impact species-level classifications, but not basal-rank affilation.
That sounds very reasonable.
If you compare the reads, and then take the similarity score as a measure of distance between two sequences, then this is a one dimensional problem.
What makes me slightly uncertain is that if you have a 500 bp amplicon then the differences in the sequences can be randomly scattered, therefore in practice this is a 500 dimensional sequence space, which may make the distances (between a given read and the set of all database sequences) a bit more complicated... or I just overcomplicate the thing.