I ran a 16s set through dada2 and SWARM and both give me lots of kingdom-only / mitochondria / unassigned stuff that is not very informative and likely non-target DNA. I thus got interested in this previous post where Mehrbod_Estaki came up with a solution to remove those non-target reads from the analysis.
Here is my question: I don't really understand where you remove those seqs from ... from the reference database against which you then assign your remaining reads? Meaning that you remove them from the GreenGenes / SILVA etc. ref-database and then run the tax-assignment again based on that "reduced" GreenGenes / SILVA db?
Or removing those seqs from the raw fastq files and re-run the whole shabang through dada..? (That would seem stretch)
Can't you just remove them altogether from the final OTU-table and not bother re-assigning? But what about the abundance-estimates? Can they still be used "as if" the non-targets were not there? There is something I don't get about the removal-solution, but I would like to better understand.
For OTU-tables, I use dada2 and SWARM. The seqs I get out of there, I assign taxonomy to using the qiime feature-classifier classify-sklearn algorithm together with the SILVA-primerspecific classifier for my assignment.
The positive filtering that I did in that post happens after DADA2. So once you have created your feature-table.qza and rep-seq.qza (post DADA2/swarm) you can use your rep-seq.qza to filter any reads that don't look anything like a real 16S sequence based on a reference database of your choice (GG/SILVA etc.) In your case, if you do have lots of host DNA then this should be able to get rid of them easily. Once you have this filtered-rep-seq.qza file, you can use that to filter your unfiltered feature-table.qza to remove any reads that are not present in the filtered-rep-seq.qza. Follow the example in that link.
I should note that in that example linked, I unnecessarily used the 99% clustered greengenes database as my reference database, but you should use something like the 88% percent OTU database which will be much faster and have the same result. I also used a 97% identity and 95% coverage parameter for searching against the reference reads but in hindsight that was way too restrictive. I would now recommend using the same parameters that Deblur does by default which is 60% coverage and 65% identity similarity. This should be quite fast and do a great job of getting rid of host DNA.
Thanks a lot Mehrbod!
Just that I get this right:
the feature-table.qza are the OTU seqs coming out of dada (incl. the unwanted host-seqs);
and the rep-seq.qza are the unwanted host-seqs?
Yes to the feature-table.qza, no to the rep-seq.qza.
When you perform dada2 in QIIME 2 it gives you a table of type FeatureTable[Frequency] (like the classic OTU table) and a set of representative sequences of type FeatureData[Sequence]. The rep-seqs is essentially just the unique features from your feature-table with the DNA sequences stored in them. We use these for alignment, filtering, tree building etc instead of the table. Your unwanted host contaminants are still in there, this is the file you use to filter against the GG database.
Ok, thanks again. I don't use dada within qiime ...
But basically, what you suggest is to remove any sequence from my OTU-table that is not found in the GG/SILVA (with some identity threshold) and that's it (?).
That brings me back to my original question: Why not simply "ignore" OTUs that were assigned to host or unassigned and only consider the ones that were assigned correctly (all after the tax-assignment)? Wouldn't that amount to the same result?
Either I go your way - filter my rep-seq to remove sequences that are not found in GG, and THEN use this filtered-rep-seq for tax-assignment...
OR I run the tax-assignment, and THEN remove OTUs from my abundance table that were found to be host and/unassigned ?
Basically, yes. Very permissive criteria here, we just want to get rid of really odd looking sequences (compared to 16S), such as host sequences.
You could do this and you would probably be fine, but this isn't exactly the same as the above.
First, you won't get anything assigned to host, because these are not host databases. Second, you could technically have sequences that are 16S but are not assigned because they are not established in the database. This is very rare in well-known communities, but in rare communities you could risk tossing away true, rare/novel bacteria. You could also be getting unassigned values because of some classification error, for instance if you forgot to remove non-biological reads, or using the wrong region of a database in your classifier etc. So it would be useful for troubleshooting purposes to do a positive control filter, and if you still get unassigned you can look into them more carefully to see why they are not being classified.
I think a lot of people do both actually. Use a positive filter, and then remove anything else that can't be classified to at least say Phylum level.
At the end of the day you might yield very similar results, but the 2 approaches are technically different.
Play around and see which one fits your case better and you prefer.
this discussion helped me very much. Of course, all the points you raised regarding host-in-database, rare communities, unassigned OTUs etc. are valid and important to consider.
This forum is a great platform to see that such problems are not an exception and that it always takes individual cleaning and curating steps that are quite specific to a particular dataset/analysis.
I am already more confident in the steps I am taking towards my final datatable.
Thanks a lot for your time and keep up the great support work here!