Hi all - I am following the shotgun metagenomics workflow outlined here to analyze whole genome demultiplexed fungal data I have. I noticed the shotgun analysis information I've found (in this workflow and otherwise) does not include any quality control steps. Should shotgun analysis include any filtering/denoising that I'm missing? Or are these only applicable to amplicon? (if so, why?)
Also, is kraken the only usable database for the shotgun workflow? Is kraken appropriate for all metagenomic data (I am mostly dealing with fungal DNA)?
Thanks!
Welcome to the forums, Mae! :qiime2:
I can help answer this part:
This is a neat question because it goes all the way back to the difference in diversity (i.e. sequence entropy/duplication rate) of targeted and untargeted sequencing, and how that informed the pipelines we built over the last 20 years.
Shotgun sequencing is untargeted by design, so lots of stuff is sequenced and reads are diverse (say, <10% duplicate reads). To make sense of all this highly diverse data, reads are matched against a database. Everything that does not match is discarded, which acts as a strong filter! The feature table we carry into downstream analysis contains database hits.
Amplicons are highly targeted by PCR, so the data is much less diverse, say with >90% duplicate reads! Many amplicon pipelines are designed to keep all the data and denoise it. The feature table we carry into downstream analysis contains real observed reads, which we later annotate with taxonomy from a database. Because reads are used directly, their quality is more important, thus the emphasis on filtering.
If on a 16S project, you drop everything unassigned at the Phylum level, this is a lot like the shotgun database filter.
It's also historical: cd-hit (2006) and uclust (2010) were popular and did clustering without a database, which benefits from quality filtering.