MOSHPIT: removing contaminant reads from metagenome data

charlesalexandreroy · July 16, 2025, 1:50am

I have some similar questions for anyone who can provide additional insight.

I've been following the MOSHPIT tutorial for performing shotgun metagenomic data analysis. The QC section titled "Host read removal" seems to describe two distinct steps (a general removal of contaminating reads and a removal of contaminating human host reads). Given that, here's what I'm wondering:

First, is my understanding correct? That is, when working with human-host microbiome samples, are two distinct filtering steps typically recommended (i.e., one for removing general (non-human) contamination and one for removing human contamination)?
If the answer to 1 is yes, which reference databases are typically recommended and where can they be accessed? The tutorial doesn't comment on this and (as far as I can tell) doesn't provide the reference_seqs.fasta reference file that is used in that part of the tutorial.

(Feel free to split this into its own topic if desired.) While on the subject of the Metagenomics Tutorial, I just wanted to mention that at least for me, the mosh fondue get-all... command in the Data retrieval portion took a couple of days to complete. It might be worth creating an even smaller example dataset (if possible) or else hosting the data directly to speed up access.

llenzi · July 17, 2025, 9:13am

Hi @charlesalexandreroy,

I am working on a metagenomics project too, my understanding of the tutorial is that the host-removal step is necessary.

Depending on what type of organism your host is, if you dealing with non-human host, I probably remove its associated sequences (by using the "general removal contaminating method" with your host genome fasta sequence), then I would remove possibly operator introduced human sequences (by the "human host specific method").
If you believe that the only possible source of host-contaminating reads could be from human (because is the actual bacterial host or because many people dealt with the samples...), I would use only the human host read decontamination method.

Please keep in mind that the human host decontaminating step download few human genome to create the pan-genome reference, so it could take few days too to finish!
I hope it helps
Luca

Nicholas_Bokulich · July 17, 2025, 9:28am

Hi @charlesalexandreroy ,

Just to add to @llenzi 's response: these are different options for contaminant filtering. It is possible to use both, as @llenzi proposes, but they are just different options and it is not necessary to chain them together. To put it another way, option 1 is a generic approach for filtering any contaminant (i.e., describing steps to create a custom index for filtering contamination from specific genomes; these could be any type of host contamination, e.g., when working in non-human models, plants, etc, also environmental samples that may have contamination from non-microbial organisms). Option 2 is a specific approach for filtering human reads by mapping to the human pangenome reference and will not work for filtering non-human reads (e.g., if working with another host organism).

charlesalexandreroy · July 19, 2025, 5:56am

Thanks so much @llenzi and @Nicholas_Bokulich, that's all very helpful!

Two quick follow-ups:

For removing non-human contamination, the author of the original thread (from which this one was split) mentioned using Kraken as the reference database. To your knowledge, is that the gold standard or are there better alternatives like GreenGenes2, SILVA, or etc? (Note, to anyone reading this I might be making categorical mistakes with these suggestions - no idea if they are useable for this scenario).
I am working with human gut samples, and a cursory look at the literature indicates that DNA in consumed food can apparently survive digestion. In your work, have you seen DNA from non-microorganism, non-host sources before (e.g. if the host happened to eat a salad for their last meal )?

Nicholas_Bokulich · July 19, 2025, 6:22am

Hi @charlesalexandreroy ,

Kraken2 is a taxonomic classification method, not a database. So you are comparing s and s

Yes, I would say that the current best practice is to use multiple methods for contaminant removal — so you can use kraken2 for read classification (using an appropriate database) in addition to mapping to a relevant host reference.

These are 16S rRNA gene databases, NOT metagenome databases, so these are not suitable for classification of metagenome reads/contaminant detection in metagenome datasets.

Yes absolutely. There is a whole field of study on leveraging these data as well, mostly in animals but also in humans as well, e.g., see:

https://doi.org/10.1038/s42255-025-01220-1

But you bring this up in the context of contamination, not because you want to see what was in that salad. This is indeed an issue for contaminant removal and why using multiple approaches (like read mapping to a reference + taxonomic classification of reads, and then discarding off-target hits) is current best practice... otherwise salad-derived reads might pass downstream and wind up skewing things like diversity metrics, classification models, etc...

I wrote a short perspective on this topic with @SoilRotifer last year... well, it was specifically on bioinformatics challenges like host read removal and other contaminants in cancer microbiome studies, but really these are quite general issues and so the concerns and advice we bring up there were written with exactly this type of thing in mind (e.g., failing to remove a sequence reads from that salad, and then mistakenly identifying that as some type of "biomarker"). That's a problem in any human (or really, any) study, not just in a cancer patient. Some of the references in this paper will be interesting reads for you as well as we covered a few key recent papers on the topic of human read filtering:

charlesalexandreroy · July 21, 2025, 7:15pm

Thanks again @Nicholas_Bokulich, this is great!

With your help, I think I identified the source of my confusion. The MOSHPIT tutorial talks about removing contaminating reads by mapping them to a "reference database". The term "database" made me think I needed some kind of specialized dataset, but as I now understand, it's just talking about reference genomes, which can be obtained from any of the usual sources. And then, the filter-reads-pangenome action is essentially just a special case (a convenience wrapper) of the filter-reads action where the reference genome is the human pangenome + the GRCh38 reference genome. If helpful, one other imprecision in the tutorial is the use of the term "mapping" when I think they mean "alignment" (e.g. see here).

The approach you outline, "mapping to a reference + taxonomic classification of reads, and then discarding off-target hits," makes good sense. There's probably better terminology for this, but I see it as essentially applying negative and positive filters to isolate the organisms you care about. That is, you align the reads to host and contaminant reference genomes to filter out reads that align to organisms you know you don't want (a sort of negative filtering), and then you classify what remains using classifiers trained on organisms you know you want and keep what gets classified (a sort of positive filtering).

For the "negative filtering" side of the equation, since it's typically impossible to know what has been consumed by the host, do you know if there exists a sort of pan-pangenome of common contaminants? Such a genome would probably be pretty unwieldy due to its size, but just curious if you're aware of something like this for use in metagenomics work?

Deyan_Donchev · July 22, 2025, 3:02pm

About this, I think one more explanation couldn't hurt.
To efficiently remove any unwanted reads or taxonomic assignments (regardless of their origin - salad or human), there are multiple approaches, which may be used by themselves or in combination:

mapping (or as you called it alignment, same thing) the reads to a reference database/genome and removing them. This is what you called negative filtering. This part of the equation you got completely right.
taxonomic classification of reads and discarding off-target hits. This is a separate approach, but it is often combined with approach 1. After you have removed reads from the previous approach, you continue to assign taxonomic origin by mapping/aligning your remaining reads to reference database (containing organisms you know and want, as you said). Depending on this reference database, you still may end up with taxonomic assignments outside of your scope. For instance, when you apply Kraken2 to your reads with Standard DB you can still get some reads assigned as human. That happens firstly because the Standard DB contains archaea, bacteria, viral, plasmid, human, and secondly because the algorithms that perform the mapping/alignment of reads differ between Kraken2 (also any other tool) and approach 1 (bowtie, bwa or any other). So here comes the "discarding off-target hits" part. Next you say Qiime to remove any hits (in this case reads), that happen to be taxonomically classified as human, to be removed based on their taxonomic assignment. You may only remove them as hits from your feature table. This second approach is also kind of a negative filter, as you might say. You see, with approach 2, you may now remove taxa that you didn't think could end up in your table, of course if the database contains them. For instance, if you use large database with plants, you will get taxa belonging to salads or vegetables. So the benefit of approach 2 is that you may remove any hits you don't want, but now you know which hits you have, whereas with approach 1 you have to do this blindly. That's why approach 1 is usually used to remove the obvious contaminant, for which you are absolutely sure it will be there, such as human reads in a human gut or sputum sample.
Altogether, both approaches complement each other; none of them is superior.

charlesalexandreroy · July 22, 2025, 5:38pm

Thanks for the input @Deyan_Donchev!

Given your outline, I suppose there are more or less three main filtering steps for contaminant removal:

Use filter-reads and/or filter-reads-pangenome to remove expected contaminants (negative filter).
Classify what remains; presumably, anything that can't be classified is discarded and everything else is kept (negative and positive filter).
Of the classified reads, discard any off-target classifications which may or may not be expected but are known to be contaminants (negative filter).

After performing step 1 from the above list (to filter out reads from your host), in step 2, do you know whether it's better to use a Kraken 2 database that includes your host (as a sort of decoy, leading to step 3), or is it better to use a database without your host, relying on failure to classify as a way to handle any remaining contaminant reads that got past step 1? I can't tell if you're recommending the use of a Kraken DB with the host or just mentioning this possibility as a thought exercise.

A few other notes:

To be clear (for anyone who skipped from the OP to here), the two steps you describe (now three ) are different from the two steps I originally asked about (as described in the "Host read removal" section of the current iteration of the MOSHPIT tutorial). Those steps are combined in your step 1.
Include / Exclude, Retain / Remove, or Select / Reject are probably clearer terms than negative / positive filtering (or filtering in vs filtering out) - apologies for that.
Mapping and alignment aren't technically the same thing (see the link I included), but that distinction is probably overly pedantic and irrelevant to this discussion.

llenzi · July 23, 2025, 9:23am

Hi @charlesalexandreroy,

Building your own kraken2 database, without your host, although it is possible, will be time and resource consuming, so I would say is much more practical and common to work with your host in the kraken database, which are provided.

But it is an evaluation you have to do on project-by-project basis I suppose.

Cheers
Luca

Deyan_Donchev · July 23, 2025, 1:37pm

I've read the host-removal steps within the moshpit tutorial again for more clarity.
You can basically differentiate between both approaches as: 1. filter out reads that do not map to a reference and 2. filter out hits/classifications from your feature table after classification.
The Host removal part of the moshpit tutorial, displayed two variants of the same approach 1.
"Removal of contaminating reads" section is, let's call it, the generic way. Here you import a reference genome (most often the host), create an indexed version of this genome, as this is the way mapping tools work, and you map your reads against it for negative filtering. Here you can use any genome.
Then the "Human host reads" section explains the filter-reads-pangenome action, which is something you called a convenient wrapper. I never tried the filter-reads-pangenome action, but from what I read, it's better, as this will use human pangenome reference data instead of just GRCh38. I assume it will be more thorough. On the negative side, this is a utility action only for human host (possibly the most common host/contamination in the research area), and it will probably take more time to finish, as "under the hood", it has to download something, create the index, and then perform the mapping and filtering. But on the positive side, you have 3 separate commands running together, and next time when you have a similar set of samples and you want to perform host reads removal, you can use this previously created index from filter-reads-pangenome with filter-reads action (will save time).

The three steps that you describe are technically and logically correct, but I don't think anybody refers to the second step the same way. In my head, that is something that comes by default. Let's say I want to classify mainly bacteria, and I use the Standard DB (because I have this one, or our design requires it, or for the sake of comparison to other results), and I want to use Kraken2. This is an example where I do not want to classify reads against viruses and human, but some may end up in my feature table. As @llenzi said, it's way faster to download a pre-compiled database (ready to use with Kraken2), instead of going through the hassle myself. ALSO, here comes the 3rd filtering step you described. Sometimes Kraken2 may classify reads as human and you now have the option to remove them, something like a double protection.

You're now asking: if my classification database doesn't include human genomes, will the human-derived reads remain unclassified and be naturally filtered out (as in your second step), or is there a risk that they could be misclassified as something else?
I do not know to what extent this could be true, but I strongly believe that they will be unclassified. Somebody correct me here or add more info. If you want only highly reliable classification, you can increase the confidence score of Kraken2, and this hypothetical scenario will be resolved.
Also I remember when I tried to build DBs for Kraken2 myself. That was a nightmare, having to use only 1 core, running for days, popping errors when downloading from the wrong ftp link (I don't think they have even resolved that). Also, having to constantly redownload and rebuild databases to be up-to-date. Ever since I discovered that website with regularly updated versions of databases, I use them all the time. Sadly, there is no bacteria-only DB. So I do all the steps.
Excuse me for mixing up mapping and aligning.

charlesalexandreroy · August 12, 2025, 5:38am

One more clarification before this topic closes. And by the way, is there a way to prevent a topic from closing after 30 days of inactivity?

Thanks @llenzi and @Deyan_Donchev for your responses.

To clarify, I wasn’t proposing that I make my own Kraken2 database. Assuming you’ve done step 1 (using filter-reads and/or filter-reads-pangenome to remove expected host contaminants), I was just wondering whether, of the databases included here, it was better to choose a database with your host or without your host. For example, if your host is human, is it better to use the Standard database which includes human (and Refeq archaea, bacteria, viral, plasmid, and UniVec_Core), or e.g. GTDB v226, which only includes Bacteria and archaea?

This question is partly motivated by what Deyan was saying here, where reads can be classified as host in step 2 even if you’ve previously filtered out host reads in step 1.

These classifications are presumably very likely incorrect, so I was thinking that the act of including the host might cause a decrease in accuracy. It’s sort of like what happens if you try aligning your reads to the wrong reference genome, where, due to sequence similarities, you’ll likely still get a low number of aligned reads. Deyan’s suggestion to increase the confidence score of Kraken2 makes good sense, but I’m still curious if anyone has looked into the effects on classification accuracy and precision, given prefiltering and the database that is chosen. E.g., by using reads from known organisms and generating ROC curves and so on.

ebolyen · August 18, 2025, 8:24pm

Yeah, we just need to post this in General, which I think is a pretty good category given the high-level discussion in this thread.

I have recategorized the topic