Large amount of unassigned sequences after classify-consensus-blast

Hi everyone,

I’m analyzing human gut microbial diversity using MegaHit assemblies of metagenomic shotgun sequence data, but I’m having trouble identifying most of the assemblies. My workflow is as follows:

  1. Megahit Assembly
  2. Import into Q2
  3. De-replicate
  4. Feature-Classification (classify-consensus-blast [–p-query-cov 0.05 --p-perc-identity 0.5 --p-evalue 10])

For ‘qiime feature-classifier classify-consensus-blast’ I’m using the “Silva 97% rep-set-all” qiime reference database.

I’ve played around with the query coverage/e-value/percent-identity a lot, to try to maximize identification, yet I can’t seem to get more than ~1,300 out of 70,000 assemblies identified. Also, I found that increasing the number of steps and the max kmer size in MegaHit greatly improved the number of identified assemblies (doubled it)

After classification I took a handful of the unidentified assemblies and ran them through BLAST manually (just to see what would happen), and I found that if I ran it through any of the reference databases (RefSeq), it couldn’t find any matches; but if I ran it through the nucleotide collection database most got a good hit.

So this leaves me with a few questions:

  1. Am I doing something wrong here?
  2. Is the relative amount of identified:unidentified contigs typical for a human gut sample?
  3. Because the “unidentified” reads aren’t mapping to any of the reference databases is it safe to assume that they can just be filtered out and ignored?

Additionally, I’ve attempted to process the same samples with metaphlan2 (it identified less than 100 species), shogun/nobunaga (couldn’t get it to work despite my best efforts), and meta-velvet assemblies with q2/classify-consensus-blast (identified ~600/70,000).

Any thoughts/suggestions/help with this would be greatly appreciated.

1 Like

Hello Nick,

I think this might be an easy fix!

Are you working with 16S/18S marker genes, or with shotgun genomic reads? Silva is only meant for 16S & 18S marker genes, so all your normal genomic or transcriptomic reads will not be in the database.

If you are working with shotgun data, try Kraken2. :octopus:
https://ccb.jhu.edu/software/kraken2/

Colin

1 Like

Hi Colin,

Yes, I’m working with shotgun data. I gave Kraken2 a try and it worked perfectly! Thank you so much for the help, I really appreciate it!

If it’s not too much trouble, I have one more question about the Kraken2 output. Is there any way to take the output with the taxa ID and one line sequence and import it into Qiime2 while maintaining the identifications? The reason I ask is because I’m trying to compare some enzymatically normalized 16S sequences that have been analyzed with Q2 with this un-normallized (control) shotgun data from the same samples to determine the effectiveness of the normalization.

Nick

1 Like

Hello Nick,

I’m glad Kraken is working for you!

I’m not sure the best way to get this into Qiime… Maybe as feature table data? Check out these importing options and see if these work for you.
https://docs.qiime2.org/2019.10/tutorials/importing/#feature-table-data

Depending on your experience, you could try exporting your data from Qiime, then do your comparison directly using R or Python.

Colin