Metaphlan2 vs vsearch for fungal metagenomic shotgun sequences

haydenbs · October 19, 2018, 4:00pm

Hey,
I'm fairly new to bioinformatics and I am working with a professor on fungal metagenomic shotgun sequences and am trying to use these reads to find the composition of the fungal diversity at (the preferred minimum) family-level classification. I have imported the file as a fastq.gz file for the forward read and the reverse read of the paired-end data.

As I've been reading through tutorials and thinking about it I have thought the two best approaches on Qiime2 would be:

Using vsearch de-novo or closed-reference approach in that I would use the (a) vsearch join-pairs, (b) quality-filter q-score-joined, (c) vsearch dereplicate-sequences, (d) vsearch cluster-features-de-novo or vsearch cluster-features-closed-reference (using unite full, untrimmed database as reference), and (e) use feature-classifier classifiy-sklearn using the complete un-trimmed unite fungal database.

or

After having imported my data use the q2-metaphlan2 profile-paired-fastq. I believe that metaphlan2 has fungal classification as well, however, I may be wrong, which would make this approach not viable.

What do you think would be best, if either or if there is another approach you believe would be better?

Thanks,
Hayden

colinbrislawn · October 19, 2018, 5:47pm

Hello Hayden,

I can take a shot at answering this question!

Option 1: this looks like a traditional analysis for PCR amplicons, which is great... but it's not a good fit for your shotgun data. Dereplication and clustering make sense when you have a single region, but it's not designed to work with data from the full genome.
Option2: this is a common method for working with shotgun reads, and that's perfect for your sequences. Do this!

I think the best person to answer this question is @fasnicar, who built the metaphlan2 plugin for QIIME2. But first, you should check out this tutorial he wrote for metaphlan2:
https://library.qiime2.org/plugins/q2-metaphlan2/12/

As you can see, metaphlan works directly on your imported fastq file, just like you described in option 2.

Colin

haydenbs · October 19, 2018, 7:50pm

Hey Colin,

Thanks for the answer! One follow up question, I'm just a bit naive, but even if the sequences haven't been amplified, if I am performing a de-novo approach and creating "OTUs" for my data and then referencing those reads back to taxonomic classifier that hasn't been trimmed/extracted, would I not get fairly decent results of classification? Or does it just have to do with the way the vsearch clusters the sequences into contigs (maybe assuming that they are ASVs?) I don't know on any of this, that's why I'm asking, so thanks for taking your time to answer this!

Hayden

colinbrislawn · October 19, 2018, 10:56pm

Hello Hayden,

And you are asking all the right questions!

I want to clarify why PCR amplicons are different than genomic reads, and why this difference means they need to be processed differently.

PCR amplicons (16S v4, 18S, ITS, etc) are all from the same region of the same gene.

read1  ------------
read2 --------------
read3 -------------

Shotgun reads (metagenomes, metatranscriptomes) are from ALL regions of ALL genes.

read1        -----------
read2 ------------
read3         ----------------

Because these reads are so very, very different, they are also processed differently.

Amplicons get clustered into OTUs or denoised into ASVs. OTUs or ASVs represents a unique region of a single gene that was targeted for PCR. Each OTU or ASV is ~90-300 bp long.
Shotgun reads get assembled into much longer reads. Each contig holds many genes and is ~1000-50,000 bp long. Totally different, right?

Nope! Vsearch clusters sequences into OTUs, dada2 denoises sequences into ASVs, and programs like metahit and Spades assemble sequences into contigs. Metaphlan2 does not assemble your reads, and it doesn't cluster your reads, and it doesn't denoise your reads.

So what does Metaphlan2 do?

Metaphlan2 "relies on ~1M unique clade-specific marker genes" for

unambiguous taxonomic assignments;
accurate estimation of organismal relative abundance;
species-level resolution for bacteria, archaea, eukaryotes and viruses;
strain identification and tracking
orders of magnitude speedups compared to existing methods.
metagenomic strain-level population genomics

(That's from the metaphlan2 documentation.)

Let me know if that helps. I think Metaphlan2 is the perfect Qiime 2 plugin for your metagenomic reads, so let us know what you find.

Have a good Friday,
Colin

haydenbs · October 20, 2018, 1:53am

Thanks Colin! That’s perfect!

fasnicar · October 22, 2018, 5:47pm

Many thanks @colinbrislawn for the answers and thanks @haydenbs for the interests in MetaPhlAn2!

However, the MetaPhlAn2 database is mainly for human metagenomics, so even though we have some Eukaryotes in the database, those are not many (compared to the bacterial markers). From the manual:

MetaPhlAn 2 relies on ~1M unique clade-specific marker genes identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic)

So, my guess is that your results will heavily depend on the restricted number of markers present in the database and potentially getting an underestimated result in diversity.

I think that in your specific case if you can use a database of only fungal sequences and map it against your reads your results will be much more accurate. Of course you should carefully check the mapping results to avoid false positives and tune the parameters of the mapping tool, but at least your results will not be limited by the size of the database.

I hope this can help you.

Many thanks,
Francesco

Nicholas_Bokulich · October 22, 2018, 6:05pm

@haydenbs,
Since it sounds like metaphlan2 might not work so well for fungi, you could consider checking out q2-shogun.

This plugin is still in active development, and we do not yet have reference databases that you can use (you will need to make your own!), which is why I did not recommend earlier.

You will need to make your own bowtie2 database, then import to QIIME 2 and you should be ready to rock. The nobunaga method described in the tutorial at the link above will perform taxonomic profiling, which sounds like just what you need.

Good luck!

jcmcnch · November 20, 2018, 1:13am

Hi folks,

I am looking for a plugin in qiime2 that could help me classify SSU reads derived from metagenomes (i.e. not amplicons but random fragments of the SSU rRNA) and stumbled across this post. I was planning on using SILVA as my database, and was wondering if q2-shogun would be an appropriate tool for this job? Or does anyone have any other recommendations? Seems like there should be a tool for this purpose, I'm just probably not aware of it.

Thanks in advance for any advice you can offer!

Cheers,

Jesse

Nicholas_Bokulich · November 20, 2018, 1:47pm

Hi @jcmcnch,
You could try qiime quality-control exclude-seqs, which will align sequences against a reference database using blastn (e.g., you could align against full-length 16S). That might not be super efficient for shotgun metagenome sequences, though (I just don't know! But it was not optimized for large seq counts)— you may want to consider using something like SortMeRNA outside of QIIME 2 or another tool designed specifically for shotgun metagenome seqs, and then import to QIIME 2 for downstream steps.

q2-shogun is definitely not the tool for the job — it is intended for alignment against full genome sequences for shotgun seq classification.

system · December 21, 2018, 7:47pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.