Generating and analyzing mock communities

Hi All,

I am currently conducting a study examining the effect of 16S primer choice and how it impacts the community profiles. For this analysis, we utilized a Zymo standard community of known composition. The analysis of the individual amplicon regions went well. However, as part of the analysis I wanted to compare the community profiles for each region to the theoretical "ideal community" so that I could run 'qiime quality-control evaluate-composition'. To do this, I downloaded the full-length 16S sequences from the product data sheet and generated an in-silico fastq file for the mock community (1x10^6 total reads and set all q-scores to 40). I took factors such as copy number and sequence variants into account.

After importing, I dereplicated the reads using 'qiime vsearch dereplicate-sequences'

zymo_mock_community_rep_seqs.qza (14.0 KB)
zymo_mock_community_dereplicated.qza (12.5 KB)

I used these as inputs for taxonomy calling and to generate a barplot. I trained a a naïve-Bayes classifier using the the formatted Silva_138 available on the data resources page. However, the resulting taxonomy file has everything classified as archaea. Very strange. Any suggestions on how to trouble shoot this would be appreciated. Thanks.

zymo_mock_community_taxa_barplot.qzv (363.7 KB)
zymo_mock_community_taxonomy.qza (73.9 KB)

Hi @micro_guy ,

Great questions.

You are starting the in silico analysis at the wrong point:

If you are trying to compare the observed vs. theoretical composition (as is the goal of evaluate-composition), then you should start with a feature table of the expected taxonomic composition (based on the taxonomic labels given in the chosen reference database that you are using for classification of your true observed sequences). To build this table you should take the list of species in your mock community (Zymo standard), find how these species are annotated in your reference database (SILVA 138), and create a table of their relative abundances in the mock community (example).

This tutorial is a bit old, but shows what this process should look like (though it does not show how the expected composition table should be constructed):

This is a more direct way to get from point A to point B. Synthesizing a FASTQ, only to dereplicate etc adds a few unnecessary steps, which might also somehow distort the actual expected abundances.

Good luck!


Hi @Nicholas_Bokulich, thanks for your response, it was very helpful. I generated the tab delimited frequency table and corresponding BIOM table which was imported.

zymo_mock_rel_abund.txt (1.0 KB)
zymo_mock_rel_abund.biom (1.6 KB)
zymo_mock_rel_abund.qza (7.1 KB)

I ran the following commands. It seemed to run fine until the final 'qiime quality-control evaluate-composition'. It generated a plugin error from quality-control: min() arg is an empty sequence. The full log output is below.

qiime taxa collapse
--i-table ~/OCS_16S_data/3_denoising/group1/Group1_V1V3_denoised.qza
--i-taxonomy ~/OCS_16S_data/4_taxonomic_analysis/group1/V1V3_taxonomy.qza
--p-level 7
--o-collapsed-table ~/OCS_16S_data/4_taxonomic_analysis/group1/Group1_V1V3_collapsed.qza

qiime feature-table relative-frequency
--o-relative-frequency-table ~/OCS_16S_data/4_taxonomic_analysis/group1/Group1_V1V3_collapsed_rel_freq.qza

qiime quality-control evaluate-composition
--i-expected-features ~/OCS_16S_data/zymo_community/zymo_mock_rel_abund.qza
--i-observed-features ~/OCS_16S_data/4_taxonomic_analysis/group1/Group1_V1V3_collapsed_rel_freq.qza
--o-visualization Group1_V1V3_quality_control.qzv

Group1_V1V3_collapsed.qza (100.3 KB)
Group1_V1V3_collapsed_rel_freq.qza (110.2 KB)

Traceback (most recent call last):
File "/opt/miniconda/envs/qiime2/lib/python3.8/site-packages/q2cli/", line 339, in call
results = action(**arguments)
File "", line 2, in evaluate_composition
File "/opt/miniconda/envs/qiime2/lib/python3.8/site-packages/qiime2/sdk/", line 234, in bound_callable
outputs = self.callable_executor(scope, callable_args,
File "/opt/miniconda/envs/qiime2/lib/python3.8/site-packages/qiime2/sdk/", line 443, in callable_executor
ret_val = self._callable(output_dir=temp_dir, **view_args)
File "/opt/miniconda/envs/qiime2/lib/python3.8/site-packages/q2_quality_control/", line 78, in evaluate_composition
results = _evaluate_composition(
File "/opt/miniconda/envs/qiime2/lib/python3.8/site-packages/q2_quality_control/", line 127, in _evaluate_composition
score_plot = _pointplot_multiple_y(
File "/opt/miniconda/envs/qiime2/lib/python3.8/site-packages/q2_quality_control/", line 281, in _pointplot_multiple_y
sns.pointplot(data=results, x=xval, y=score, ax=axes, color=color)
File "/opt/miniconda/envs/qiime2/lib/python3.8/site-packages/seaborn/", line 2839, in pointplot
plotter = _PointPlotter(x, y, hue, data, order, hue_order,
File "/opt/miniconda/envs/qiime2/lib/python3.8/site-packages/seaborn/", line 1603, in init
self.establish_colors(color, palette, 1)
File "/opt/miniconda/envs/qiime2/lib/python3.8/site-packages/seaborn/", line 707, in establish_colors
lum = min(light_vals) * .6
ValueError: min() arg is an empty sequence

Hi @micro_guy,
One quick suggestion - have you viewed summaries of those feature tables that you're creating (i.e., the result of qiime feature-table summarize)? That can be useful to let you know if something went wrong in one of the previous steps (e.g., you have no features or samples in the table, for some reason). That's usually my first step in debugging something like this.

1 Like

@gregcaporaso Thank you for the response. I have viewed the summaries of the feature tables and have included them with the response. I am having a bit of trouble interpreting the summaries of these tables to determine what the issue is. Any assistance would be greatly appreciated. Thanks.

Group1_V1V3_collapsed_rel_freq.qzv (486.2 KB)
zymo_mock_rel_abund.qzv (374.2 KB)
Group1_V1V3_collapsed.qzv (482.5 KB)

Hi @micro_guy ,
Thanks for sharing the QZVs. Your error message is implying that one of the data inputs to seaborn (for generating plots) is empty.

It looks like the issue is that your sample IDs do not match between the expected and observed tables. This is typical — as you have multiple standards in your observed table so each needs a unique ID. In this case, you should pass a metadata file that specifies which sample in the expected table corresponds to each sample in the observed table. From the help docs:

  --m-metadata-file METADATA
  --m-metadata-column COLUMN  MetadataColumn[Categorical]
                       Optional sample metadata that maps observed-features
                       sample IDs to expected-features sample IDs.  [optional]

It also looks like the input table that you are passing to evaluate-composition contains a mix of mock community standards as well as some other samples. If these other samples are not standards but real samples, you should remove them before running this action (as the comparison of real samples to a mock community would be meaningless).

Please give that a try and let us know!

Thank you for the feedback. It was just a case of RTFM. I didn't realize every sample in the observed feature table had to have a corresponding sample in the expected feature table. I just had one sample in the expected table. I went ahead and added all the samples to the expected feature table and it ran great. TDR and bray-curtis dissimilarity look great but the TAR and obs/expected look pretty poor. Thanks.

Group1_V1V3_quality-control.qzv (485.0 KB)

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.