We want to use some ground-truth mock micobiome data sets to test our internal pipeline. The one we found is mockrobiota in github from your group. But when I use that, I found some issues when running using QIIME2. For mock-8 data sets, I generate qzv file. It has “Some of the PHRED quality values are out of range” issue. It also has some error issues by using those mock data set.
But in general, do you think this data source is best mock data set to test based on QIIME2? We will be appreciated if we can get more information about that.
@Nicholas_Bokulich would obviously be the best person to answer this question for you, however as I believe he is away till next week, I’ll take a crack at this.
From the mockrobiota github page: a note on the description of this dataset:
Note: These barcode reads contain golay barcodes, and the mapping barcodes need to be reverse-complemented to match the reads. Run in qiime-1 using the following command: split_libraries_fastq.py -i mock-forward-read.fastq.gz -o split_libraries -m sample-metadata.tsv -b mock-index-read.fastq.gz --rev_comp_mapping_barcodes
The original QUAL scores for the index/barcode reads were not recovered, and thus mock-index-read.fastq.gz contains artificial index/barcode QUAL scores. QUAL scores from all other files are original.
Sounds like maybe these artificial qual-scores is the issue. You can either use a different community or change these quality scores to something else, or perhaps even easier, use the older phred score variant settings in your import command: PairedEndFastqManifestPhred64V2.
Regarding your other question about whether these are good data sets to use in qiime2, Nick can answer more in detail when he is back in office.
Thanks for quick response. I believe I used the correct command. Attach qzv I generated. You can see if you click quality plot, it will have PHREAD error. Also It seems that my number matches the following thread https://github.com/caporaso-lab/mockrobiota/issues/59. For this data set, it seems reverse reads have low qualities. I think most demultiplexed reads are generated from forward reads.
We can wait @Nicholas_Bokulich’s answers. The most important question for us is that are there a good mock data sets (paired end processed by QIIME2) we can use. I think each data may have their own unique issues. If someone in your group has some insights about that, it will be hugely appreciated.demux.qzv (294.0 KB)
It sounds like these have the wrong PHRED encoding, as @Mehrbod_Estaki mentioned (only the barcodes have artificial qual scores so that is not the issue, but the encoding is). It is not your fault — you are using the correct command but right now the EMP format does not have a way to select the Phred encoding, and it just assumes that you are using Phred33 (Phred64 is an old encoding! mock-8 is a very old dataset). If you want to use mock-8, see the workaround described here:
For all benchmarking it is best to use multiple datasets so that you aren’t overfitting on a single dataset or single mock community. So there is no “best”, but I can offer some advice. The smaller the number, the older the dataset… so try mock-16 through mock-23, these are more up-to-date 16S datasets (i.e., more “modern” read lengths, and probably Phred64 but don’t know off-hand).
Some of the older datasets are very useful, and e.g., mock-3 is very small so it is useful for setting up a test pipeline quickly, but older datasets can have issues with obsolete data formats as you have discovered.
If you have or find more recent mock communities please feel free to post them in mockrobiota so that others can re-use.
No worries, forum topics only close after 30 days of inactivity. I have “unqueued” meaning that moderator attention is not needed at this time — please feel free to re-post to this same topic if you have any follow-up questions or comments, and the topic will be re-queued. Happy new year.
Thanks for your comments above. We focus on mock-20. One of recent 16S datasets. We have tried different filtering setting which get different number sequence variants reported and number of unqiue taxonomy reported. According to you github files, you listed 20 expected taxonomy from this sample by SILVA database. We got 22 unique genus taxonomy reported based on our run.
I compared each taxonomy. It seems there is only one taxonomy missing from your expected list… Bacteria;Firmicutes;Bacilli;Bacillales;Planococcaceae;NA. There are two sequence variants associated with that. So can you tell me if this extra taxonomy is right or wrong. Thanks for sharing insights.
One feature of mock communities (a strength or a weakness, depending on how you look at it) is that they will amplify contaminants and other biological noise. So I suspect this is a background or reagent contaminant that is being detected. The expected taxonomy files are based on the list of species added to the mock communities by the creators of these data and will only contain the strains that were physically added. It does not look like any Planococcaceae are in that list, so this is probably reagent contamination or sequence error.
This is one reason why mock communities are useful for benchmarking — they reflect the noise inherent in actual biological conditions, where things like contamination and sequencing error do occur all the time. So in practice mock communities should be used to assess the relative performance of any pipeline that you use, AND multiple mock communities should be tested to avoid overfitting.
One of sequence is TACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGTGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACACTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCCAGTAGTCCGGCTGAC. I don’t know why this sequence was assigned to Planococcaceae.
Is this DADA2 issue or really mock data contamination? Thanks.