High percentage of unassigned reads (metagenomics)

timanix · July 3, 2025, 7:03am

Hi all,

I am working on the pig fecal samples that were sequenced with shotgun approach (metagenomic samples).
My pipeline included:

QC
Host DNA removal
Import to Qiime2
Taxonomy annotation with Kraken2 (moshpit) vs "nt" database, confidence 0

As the result, I still have a lot of reads that are unassigned to any taxa. I suspect that there are a lot of DNA from feed / contamination / residual DNA of the host that are still not annotated. My questions:

Do you have experience with similar samples?
Is it normal to have so high percentages of unassigned reads?
Are there any things I missed in the pipeline?

Example figure:

Best,
Timur

SoilRotifer · July 3, 2025, 1:44pm

Hi @timanix,

I also have some experience with feral pig shotgun metagenomics work. Sadly nothing came of it and it never saw the light of day. But I had similar issues...

I will say that when it comes to host removal I find that being more broad helps. For example, as we describe here, it is best to use a pangenome if one exists, or use a suite of related reference genomes. I often take the approach of removing as much mammalian DNA as I can. For example, when analyzing rat microbiome data, I removed reads that mapped to: rat, mouse, pig, chimp, human (taking the output from one filtering step and using as input into the next filtering step, in that order), then proceeded to my analyses. You may want to look at the ingredients of the feed, and also remove associated genomes, like corn, wheat, etc... That will help you figure out what these unassigned reads are.

Assuming, you are interested in microbes, and not diet items, this should help quite a bit. If you are looking for diet items, which is what we were trying to do at the time... I could not remove all mammalian reads, as feral pigs are known to eat animals, and we wanted to detect those...

-Cheers!
-Mike

timanix · July 3, 2025, 1:59pm

Hi @SoilRotifer
Thank you for the reply!

Yes, I tried to remove host DNA by providing several genomes. Kraken2 with nt db still detected a small portion of reads assigned to pigs.

I ended up keeping only bacterial reads for diversity metrics and DA tests. I then constructed MAGs and obtained some strangely annotated MAGs, which are likely not of bacterial origin. So I removed all non-bacterial MAGs before functional annotations.

Genomes from the feed were also already added by me to the host removal step.

Sounds like an amazing investigation!
Luckily for me, I am primarily interested in the bacterial community.

My steps that I took to deal with the issue:

Remove all non-bacterial reads before diversity metrics and DA tests
Remove non-bacterial MAGs before functional annotation

Thank you again for answering this topic. It appears that it is normal to have high amounts of unassigned reads with pig feces.

Best,
Timur

SoilRotifer · July 3, 2025, 2:12pm

It seems like you are already on the right track.

I was recently advised to also try using GTDBtk for the MAGs. My understanding is that Kraken is better for identifying short reads and not MAGs. Perhaps others have experience with this phenomenon?

I am currently working on a dataset were the majority of my MAGs were unclassified with Kraken. I am currently validating the results with GTDBtk, by feeding the MAG fasta files to GTDBtk.

timanix · July 3, 2025, 2:17pm

Great!

I also heard the same!

It was in my to-do list... Please share with me if you get big differences. I am using Kraken2 for both reads and MAGs just to keep the same taxonomy, but I had the same idea to check

SoilRotifer · July 3, 2025, 2:28pm

I just looked at my results. I was able to classify (only including MAGs that have a at least a phylum-level classification for Bacteria), 257 MAGs with Kraken and 369 MAGs with GTDBtk.

-Mike

timanix · July 3, 2025, 2:30pm

That is a striking difference!
Thank you for the update.