Questions about mock microbiome data sets

Hi QIIME2 support,

We want to use some ground-truth mock micobiome data sets to test our internal pipeline. The one we found is mockrobiota in github from your group. But when I use that, I found some issues when running using QIIME2. For mock-8 data sets, I generate qzv file. It has “Some of the PHRED quality values are out of range” issue. It also has some error issues by using those mock data set.

But in general, do you think this data source is best mock data set to test based on QIIME2? We will be appreciated if we can get more information about that.

Thanks,
Yunhu Wan

Hi @mentorwan,
Welcome back to the forum!

@Nicholas_Bokulich would obviously be the best person to answer this question for you, however as I believe he is away till next week, I’ll take a crack at this.
From the mockrobiota github page: a note on the description of this dataset:

Note: These barcode reads contain golay barcodes, and the mapping barcodes need to be reverse-complemented to match the reads. Run in qiime-1 using the following command: split_libraries_fastq.py -i mock-forward-read.fastq.gz -o split_libraries -m sample-metadata.tsv -b mock-index-read.fastq.gz --rev_comp_mapping_barcodes

The original QUAL scores for the index/barcode reads were not recovered, and thus mock-index-read.fastq.gz contains artificial index/barcode QUAL scores. QUAL scores from all other files are original.

Sounds like maybe these artificial qual-scores is the issue. You can either use a different community or change these quality scores to something else, or perhaps even easier, use the older phred score variant settings in your import command: PairedEndFastqManifestPhred64V2.

Regarding your other question about whether these are good data sets to use in qiime2, Nick can answer more in detail when he is back in office.

1 Like

Thanks for quick response. I believe I used the correct command. Attach qzv I generated. You can see if you click quality plot, it will have PHREAD error. Also It seems that my number matches the following thread https://github.com/caporaso-lab/mockrobiota/issues/59. For this data set, it seems reverse reads have low qualities. I think most demultiplexed reads are generated from forward reads.

We can wait @Nicholas_Bokulich’s answers. The most important question for us is that are there a good mock data sets (paired end processed by QIIME2) we can use. I think each data may have their own unique issues. If someone in your group has some insights about that, it will be hugely appreciated.demux.qzv (294.0 KB)

Hi @mentorwan,

It sounds like these have the wrong PHRED encoding, as @Mehrbod_Estaki mentioned (only the barcodes have artificial qual scores so that is not the issue, but the encoding is). It is not your fault — you are using the correct command but right now the EMP format does not have a way to select the Phred encoding, and it just assumes that you are using Phred33 (Phred64 is an old encoding! mock-8 is a very old dataset). If you want to use mock-8, see the workaround described here:

For all benchmarking it is best to use multiple datasets so that you aren’t overfitting on a single dataset or single mock community. So there is no “best”, but I can offer some advice. The smaller the number, the older the dataset… so try mock-16 through mock-23, these are more up-to-date 16S datasets (i.e., more “modern” read lengths, and probably Phred64 but don’t know off-hand).

Some of the older datasets are very useful, and e.g., mock-3 is very small so it is useful for setting up a test pipeline quickly, but older datasets can have issues with obsolete data formats as you have discovered.

If you have or find more recent mock communities please feel free to post them in mockrobiota so that others can re-use.

I hope that helps!

Thanks Nicholas for insights. Please don’t close this issue yet. I may have follow-up questions related. Thanks and Happy new year.

No worries, forum topics only close after 30 days of inactivity. I have “unqueued” meaning that moderator attention is not needed at this time — please feel free to re-post to this same topic if you have any follow-up questions or comments, and the topic will be re-queued. Happy new year.

Hi Nicholas,

Thanks for your comments above. We focus on mock-20. One of recent 16S datasets. We have tried different filtering setting which get different number sequence variants reported and number of unqiue taxonomy reported. According to you github files, you listed 20 expected taxonomy from this sample by SILVA database. We got 22 unique genus taxonomy reported based on our run.

I compared each taxonomy. It seems there is only one taxonomy missing from your expected list… Bacteria;Firmicutes;Bacilli;Bacillales;Planococcaceae;NA. There are two sequence variants associated with that. So can you tell me if this extra taxonomy is right or wrong. Thanks for sharing insights.

One feature of mock communities (a strength or a weakness, depending on how you look at it) is that they will amplify contaminants and other biological noise. So I suspect this is a background or reagent contaminant that is being detected. The expected taxonomy files are based on the list of species added to the mock communities by the creators of these data and will only contain the strains that were physically added. It does not look like any Planococcaceae are in that list, so this is probably reagent contamination or sequence error.

This is one reason why mock communities are useful for benchmarking — they reflect the noise inherent in actual biological conditions, where things like contamination and sequencing error do occur all the time. So in practice mock communities should be used to assess the relative performance of any pipeline that you use, AND multiple mock communities should be tested to avoid overfitting.

I hope that helps!

One of sequence is TACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGTGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACACTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCCAGTAGTCCGGCTGAC. I don’t know why this sequence was assigned to Planococcaceae.

Is this DADA2 issue or really mock data contamination? Thanks.

That sequence is BLASTing to Bacillus cereus (or another Bacillus sp.), which is one of the expected taxa added to the mock community.

Sounds like it is not contamination, sounds like it is an issue with sequence error or taxonomy misclassification (quite possibly a misannotation in your reference database?).

I’ve used this mock community pretty extensively for benchmarking and never had any hits assigned to Planococcaceae, though I was using Greengenes and not SILVA as a reference database, e.g., you can check out some precomputed results for various taxonomy classifiers here: https://github.com/caporaso-lab/tax-credit-data/tree/master/data/precomputed-results/mock-community/mock-20/gg_13_8_otus

So this might be related to the database you are using?

Thanks. I tried to use GG_13_8_97 as you suggested, it seems to be correct one. But I tried different version of SILVA database from v123 to v132. it still shows Planococcaceae as family level.

Did you use SILVA for mock dataset? Thanks a lot for your comments.

There might be a Bacillus sequence in the SILVA database that is misannotated as Planococcaceae.

The expected taxonomies are provided for both SILVA and GG, but I have not tested mock-20 specifically with the SILVA database.

A post was split to a new topic: dada2 minFoldParentOverAbundance parameter and sequence yield