Should I filter out my mock community?

Hi everyone,

Tried searching the forum but I couldn’t find anyone having asked this before! I have positive (mock community) and negative (water put through whole lab protocol) controls alongside my 160 “real” samples.

  • The mock community is massively different from my actual samples
  • The -ve control resembles my actual samples almost perfectly in composition, just with a tiny fraction of the reads - leading me to believe that this is just cross-indexing from the sequencer itself. It doesn’t appear in my results because - I assume - it’s been dropped from the core-metrics-phylogenetic analysis because of the sampling depth parameter. So I’m thinking I should maybe just remove it anyway.

I’ve already done most of my analysis but now I have to go back and start again anyway, removing a few of the samples that shouldn’t have been in there in the first place. I’m wondering if I should also remove the mock community before I generate diversity measurements and use gneiss and ANCOM. I know that for most (all?) of the statistical tests (whose results you see on Qiime 2 View), it will be excluded anyway because of missing metadata - i.e. it doesn’t have an associated age/sex/pen etc., but I notice that the mock community still shows up in my Emperor plots.

Basically, I’m wondering if the mere presence of the mock community could skew my diversity measurements and/or affect clustering on the PCoA plots. And is there any other reason why it might be a good idea to remove both controls for some downstream analysis that I haven’t done yet?

Thank you in advance for your help!

Hi @xchromosome,
That is unfortunate about your controls, but it is great that you included them in your analysis. There are lots of discussions on the forum with regards to controls so I won’t go down that path here, feel free to browse around for those.
A couple of notes, instead of going all the way back to re-analyze your data after filtering, you could simply change your metadata file to the samples you want and use that. Might save you a few steps. You can also simply turn off samples you don’t want in emperor, though I do find my plots generally to be cleaner when I run them without extra samples.

If they have an ID in the metadata file they will show up.

Yes and no. Yes in that the plot will look a bit different because it will have to zoom and center around additional points which are likely very different. No in that the actual distance between the other points will not change. Those are only dependent on the pairs being compared. That means tests like PERMANOVA should not be changed as long as your mock community is not assigned to a group category that is being compared…

Not really as long as you are not including them in your tests, though I think one possible place where these may be influential is in building a de novo tree. If the unique features in your community are really drastically different it may produce some artificial distance branches. My phylogenetics is not great so someone else will have to confirm this.
Personally, once I know which samples I need for a project I tend to filter the rest and only keep those. This makes sharing the data a lot less confusing as well.

1 Like


Thanks for taking the time to reply. That’s reassured me, apart from this part!..

First of all I should clarify just in case I hadn’t made it clear before, when I said my mock community was massively different to my actual samples, I meant it’s different to my other 160 “real” samples. I didn’t mean that its true community structure was poorly identified/assigned. The Qiime 2 pipeline actually did a pretty good job of elucidating its true composition! The mock community I used was the BEI resources HM-782D which has DNA from a really varied mix of bacteria - Neisseria meningitidis, Deinococcus radiodurans, Helicobacter pylori,… all the famous ones :grin: but not at all what you’d expect to find in the chicken caecum (what my real samples are)! So I was wondering about my tree and if these extra branches would impact on my results, and if so, how bad that would be. Is there any way of tagging a phylogenetic-y person to ask about how this might affect things?

I actually missed that you’d replied and powered ahead with re-doing my analysis anyway over the weekend while I was waiting (or so I thought!) so the thought of doing it all again makes me feel ill. However if it needs done then so be it :sob:


1 Like

Hi @xchromosome,
Thanks for the clarification. The fact that mock community is so distinct from your samples was actually the reason I brought up my concern. The case presented in the fragment-insertion paper/tutorial was I guess the stem of my concern. There you can see that some outgroup taxa COULD potentially distort the overall tree and thus clustering but in an internal discussion we think this probably won’t be an issue here but still it would be safer to remove the mock community and re-build your tree. Your stats test shouldn’t be influenced by this since in those instances you will only be comparing your groups and not the mock community, also only tests with a phylogenetic component would be influenced anyways.
In short, it would be safer to remove the mock community before tree building but if time and resources are limited you’ll probably be ok anyways. If you do happen to do both ways, feel free to share the results with us, it certainly will help us with future inquiries like this!
Good luck!

1 Like


Thanks for the advice and the link - that was really interesting! I think I will remove the mock community and negative control and start again. I’ll post the results on here so you can see the difference!

So just to check I’m doing this right… I want to re-run core-metrics-phylogenetic, so for that I need a rooted tree, which I need to rebuild, and for that I need my masked-aligned-representative-sequences.qza, which I get as output from Deblur… so I need to re-run Deblur, is that right? And make a new metadata file without the controls in it? How do I actually remove the control samples? While I’m doing that, I should maybe also remove the 6 samples that I don’t want to include in my analysis too (due to contamination on the farm). I had already removed them from my table before running alpha/beta diversity measurements but maybe I’d be better off just removing them completely so they don’t interfere with the tree either.

Is there any reason why I would have to assign taxonomy to my reads again? I’m assuming no, but just checking! I actually think I might have to anyway for a different reason but I’ll post a new topic on that!

Hi @xchromosome,

Not needed. You can take your existing feature-table and rep-seqs.qza file that you got from Deblur or DADA2 and work from there.

  1. Filter the samples you want from your feature table using the filter-sample plugin.
  2. Use the new filtered table to remove sequences from your rep-seqs.qza file with the filter-seqs plugin.
    Then you can build your new tree with this new rep-seqs file. You can also use the fragment-insertion plugin for tree building.

In the future, if you just want to exclude some samples from your feature-table you can either a) use the filtering options I mentioned above or simply make a new metadata file with those samples removed. With that regards, any downstream tests you do will just draw from those samples only and ignore the rest. Some tests even let you choose specific groups within your metadata file.

As long as you haven’t rerun your denoising scripts with new parameters the one you have should be fine. This is because the taxonomy artifact will simply have some extra taxa which just won’t be called on once you’ve filtered your table. This is one bonus of keeping taxonomy independent of your table.

Keep us posted!


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.