I’m analyzing human gut microbial diversity using MegaHit assemblies of metagenomic shotgun sequence data, but I’m having trouble identifying most of the assemblies. My workflow is as follows:
- Megahit Assembly
- Import into Q2
- Feature-Classification (classify-consensus-blast [–p-query-cov 0.05 --p-perc-identity 0.5 --p-evalue 10])
For ‘qiime feature-classifier classify-consensus-blast’ I’m using the “Silva 97% rep-set-all” qiime reference database.
I’ve played around with the query coverage/e-value/percent-identity a lot, to try to maximize identification, yet I can’t seem to get more than ~1,300 out of 70,000 assemblies identified. Also, I found that increasing the number of steps and the max kmer size in MegaHit greatly improved the number of identified assemblies (doubled it)
After classification I took a handful of the unidentified assemblies and ran them through BLAST manually (just to see what would happen), and I found that if I ran it through any of the reference databases (RefSeq), it couldn’t find any matches; but if I ran it through the nucleotide collection database most got a good hit.
So this leaves me with a few questions:
- Am I doing something wrong here?
- Is the relative amount of identified:unidentified contigs typical for a human gut sample?
- Because the “unidentified” reads aren’t mapping to any of the reference databases is it safe to assume that they can just be filtered out and ignored?
Additionally, I’ve attempted to process the same samples with metaphlan2 (it identified less than 100 species), shogun/nobunaga (couldn’t get it to work despite my best efforts), and meta-velvet assemblies with q2/classify-consensus-blast (identified ~600/70,000).
Any thoughts/suggestions/help with this would be greatly appreciated.