Question how to evaluate sequence quality

Hi Everyone,

I’m trying to follow the Evaluating and controlling data quality with q2-quality-control tutorial using the data from our lab. We used this following MOCK community: HM-783D, which contains 20 bacterial strains. I do not know how I can obtain the reference-seqs.qza and the qc-mock-3-expected.qza for this MOCK community? I just know its composition.

At the same time, I received two files related to the MOCK sample (one for the forward reads and the other one for the reverse reads) from the sequencing center, along with the rest of the raw data.

So, I created a folder, where I put only the two MOCK files (forward and reverse reads). From there, I imported this sample into Qiime2 (using Casava 1.8 paired-end demultiplexed fastq option) --> denoised using DADA2, which allowed me to get a table.qza file and a rep-seqs.qza. Is this approach correct? These files would be the query-table.qza and the qc-mock-3-observed.qza that I need for this tutorial, respectively?

Many thanks in advance,
FS

1 Like

Good morning,

Yep, your process should work great! With those two files, you should be able to use the quality control
plugin
.

Let of know if you have any questions about your results!
Colin

Hi @fstudart,

I have some good news for you:

There is a growing database of mock community datasets and resources in mockrobiota that you should check out. In particular, it sounds like your mock community may have the same exact composition as mock-21 and/or mock-23. These were generated from (I think) the same mock community from BEI, but using the high concentration rather than low concentration (HM-783D) product. So check the expected taxonomies and expected sequences in the source directory of each of those mock communities — you may be able to just use those files as-is for your data.

(if you use any of the resources in mockrobiota, please make sure to cite mockrobiota and the original source publication for any datasets that you use, since these are not part of QIIME2)

If mockrobiota does not have what you need:

qc-mock-3-expected.qza is essentially just a composition table converted to a biom, converted to a FeatureData[RelativeFrequency] artifact. You can export the data from that file to take a look at the original tab-separated table, and figure out how to format/convert your own.

reference-seqs.qza is only used by the evaluate-seqs action, so you can still use evaluate-composition without it. This is a fasta of expected sequences that correspond to the members of the mock community; if you don’t have that, you are unable to use the evaluate-seqs action.

Yes, as @colinbrislawn advised, you are on the right track.

Yes qc-mock-3-observed.qza. Not query-table.qza, though — which is used in a different tutorial on the q2-quality-control tutorial page, so not relevant to evaluate-composition.

I hope that helps!

1 Like

Hi, Thanks for your reply. It helped me a lot. I checked out the mockrobiota website and I downloaded the expected-sequences.fasta related to MOCK-21. Now, I’m trying to import the expected fasta file related to this MOCK community into QIime2, using:

qiime tools import
–input-path MOCK
–output-path sequences.qza
–type ‘FeatureData[Sequence]’

But, I’m getting an error: Missing one or more files for DNASequencesDirectoryFormat: ‘dna-sequences.fasta’. Could you help me with that?

Thanks,
FS

.

I believe your import command should be something like the following:

qiime tools import \
    --input-path expected-sequences.fasta \
    --output-path expected-sequences.qza \
    --type 'FeatureData[Sequence]'

It looks like you are attempting to import a directory called “MOCK”, so the importer is expecting a named file inside. Import the file explicitly instead.

Hi, Thanks for your reply,

I actually downloaded only the fasta file (from https://github.com/caporaso-lab/mockrobiota/blob/master/data/mock-21/source/expected-sequences.fasta). I created a folder which contains only this fasta file (expected-sequences.fasta). When I try to import it into Qiime2 using the command you typed above, I’m getting the same error:

There was a problem importing expected-sequences.fasta: expected-sequences.fasta is not a(n) DNAFASTAFormat file.

I don’t know if there is a problem with this fasta file. Should I try to use another one?

Thanks very much for all your support,
FS

Hi @fstudart,
I am able to import this file without issue. Here you go: expected-sequences.qza (10.5 KB)

Either there is something wrong with the file you downloaded, or with your version of QIIME2. What version of QIIME2 are you running? Did you inspect the file that you downloaded to make sure it looks okay?

In any case, the file I attached should work for you.

Good luck!

Hi, Thanks very much for sending me the qza file. I think I was actually using a wrong file, as I was able to import another fasta file with no issues. I’m using Qiime2-2018.2 (via VirtuaBox). I was also able to use qiime quality-control evaluate-seqs plugin (Evaluating sequence quality tutorial). The results (related to the comparison between my query sequences and the expected sequences) seemed to be good (from what I understood), there were some mismatches though. Is there a way of obtaining an overall sequencing error rate based on the eval-seqs-test.qzv file?

Thanks very much for all your support,
FS

Great! Some amount of mismatches is to be expected… no denoising method is perfect :slightly_smiling_face:

I suppose you could tally the total number of mismatches across all sequence variants, weighted by the abundance of each sequence. Would that accomplish what you need? To do so, you can download the results from the QZV as a TSV and analyze in R or jupyter notebooks.

Good luck!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.

@fstudart,
Note that a bug was found in quality-control evaluate-composition that caused TAR and TDR scores to be reversed. See this announcement for more details if you are affected.