Hey,
I’m fairly new to bioinformatics and I am working with a professor on fungal metagenomic shotgun sequences and am trying to use these reads to find the composition of the fungal diversity at (the preferred minimum) family-level classification. I have imported the file as a fastq.gz file for the forward read and the reverse read of the paired-end data.
As I’ve been reading through tutorials and thinking about it I have thought the two best approaches on Qiime2 would be:
After having imported my data use the q2-metaphlan2 profile-paired-fastq. I believe that metaphlan2 has fungal classification as well, however, I may be wrong, which would make this approach not viable.
What do you think would be best, if either or if there is another approach you believe would be better?
Option 1: this looks like a traditional analysis for PCR amplicons, which is great… but it’s not a good fit for your shotgun data. Dereplication and clustering make sense when you have a single region, but it’s not designed to work with data from the full genome.
Option2: this is a common method for working with shotgun reads, and that’s perfect for your sequences. Do this!
Thanks for the answer! One follow up question, I’m just a bit naive, but even if the sequences haven’t been amplified, if I am performing a de-novo approach and creating “OTUs” for my data and then referencing those reads back to taxonomic classifier that hasn’t been trimmed/extracted, would I not get fairly decent results of classification? Or does it just have to do with the way the vsearch clusters the sequences into contigs (maybe assuming that they are ASVs?) I don’t know on any of this, that’s why I’m asking, so thanks for taking your time to answer this!
Because these reads are so very, very different, they are also processed differently.
Amplicons get clustered into OTUs or denoised into ASVs. OTUs or ASVs represents a unique region of a single gene that was targeted for PCR. Each OTU or ASV is ~90-300 bp long.
Shotgun reads get assembled into much longer reads. Each contig holds many genes and is ~1000-50,000 bp long. Totally different, right?
Nope! Vsearch clusters sequences into OTUs, dada2 denoises sequences into ASVs, and programs like metahit and Spades assemble sequences into contigs. Metaphlan2 does not assemble your reads, and it doesn't cluster your reads, and it doesn't denoise your reads.
So what does Metaphlan2 do?
Metaphlan2 "relies on ~1M unique clade-specific marker genes" for
unambiguous taxonomic assignments;
accurate estimation of organismal relative abundance;
species-level resolution for bacteria, archaea, eukaryotes and viruses;
strain identification and tracking
orders of magnitude speedups compared to existing methods.
Many thanks @colinbrislawn for the answers and thanks @haydenbs for the interests in MetaPhlAn2!
However, the MetaPhlAn2 database is mainly for human metagenomics, so even though we have some Eukaryotes in the database, those are not many (compared to the bacterial markers). From the manual:
MetaPhlAn 2 relies on ~1M unique clade-specific marker genes identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic)
So, my guess is that your results will heavily depend on the restricted number of markers present in the database and potentially getting an underestimated result in diversity.
I think that in your specific case if you can use a database of only fungal sequences and map it against your reads your results will be much more accurate. Of course you should carefully check the mapping results to avoid false positives and tune the parameters of the mapping tool, but at least your results will not be limited by the size of the database.
@haydenbs,
Since it sounds like metaphlan2 might not work so well for fungi, you could consider checking out q2-shogun.
This plugin is still in active development, and we do not yet have reference databases that you can use (you will need to make your own!), which is why I did not recommend earlier.
You will need to make your own bowtie2 database, then import to QIIME 2 and you should be ready to rock. The nobunaga method described in the tutorial at the link above will perform taxonomic profiling, which sounds like just what you need.
I am looking for a plugin in qiime2 that could help me classify SSU reads derived from metagenomes (i.e. not amplicons but random fragments of the SSU rRNA) and stumbled across this post. I was planning on using SILVA as my database, and was wondering if q2-shogun would be an appropriate tool for this job? Or does anyone have any other recommendations? Seems like there should be a tool for this purpose, I’m just probably not aware of it.
Hi @jcmcnch,
You could try qiime quality-control exclude-seqs, which will align sequences against a reference database using blastn (e.g., you could align against full-length 16S). That might not be super efficient for shotgun metagenome sequences, though (I just don’t know! But it was not optimized for large seq counts)— you may want to consider using something like SortMeRNA outside of QIIME 2 or another tool designed specifically for shotgun metagenome seqs, and then import to QIIME 2 for downstream steps.
q2-shogun is definitely not the tool for the job — it is intended for alignment against full genome sequences for shotgun seq classification.