Problems with q2-quality-control using Mockrobiota

xchromosome · June 22, 2019, 2:27pm

Hiya,

I am having a nightmare trying to run q2-quality-control using my mock community. Apologies for the length of this post! It's a beast. So, specifically, I am trying to run evaluate-composition. My command was based on the following:

qiime quality-control evaluate-composition
--i-expected-features qc-mock-3-expected.qza
--i-observed-features qc-mock-3-observed.qza
--o-visualization qc-mock-3-comparison.qzv

My first step was to generate my own versions of qc-mock-3-expected.qza and qc-mock-3-observed.qza. For the observed.qza, I re-imported my mock community reads on their own, and in the manifest file, I called them Mock-community. Then joined them, filtered them and denoised using Deblur and got my table.
For the "expected.qza", I used the Mockrobiota resource that I found mentioned in these forums (great idea! ). At first I downloaded the "expected-sequences" file, as was suggested here, using this link (I checked which mock community was appropriate for me (BEI Resources HM-782D - 20 species, even and low concentration): mockrobiota/data/mock-22/source/expected-sequences.fasta at master · caporaso-lab/mockrobiota · GitHub and tried to run the command, but it threw up an error:

Plugin error from quality-control:

Parameter 'expected_features' received an argument of type FeatureData[Sequence]. An argument of subtype FeatureTable[RelativeFrequency] is required.

No worries, I thought, it wants a table. So I downloaded the forward and reverse reads from Mockrobiota, imported them and called them "mockrobiota-expected", joined and filtered them, ran them through Deblur and got my table. Then I tried running qiime quality-control evaluate-composition again:

Plugin error from quality-control:

Parameter 'expected_features' received an argument of type FeatureTable[Frequency]. An argument of subtype FeatureTable[RelativeFrequency] is required.

So I converted both my tables to relative frequency as follows:

qiime feature-table relative-frequency
--i-table deblur/mockrobiota-table.qza
--o-relative-frequency-table mockrobiota-relative-frequency.qza

Plugin error from quality-control:

"None of [Index([Mock-community'], dtype='object')] are in the [index]"

At this point I noticed that in the tutorial, it states: "evaluate_composition compares the feature composition of pairs of observed and expected samples containing the same sample ID in two separate feature tables." I figured this might be the problem, as I had my two feature tables, but called one sample Mock-community and the other mockrobiota-expected, so I set about trying to name them the same thing. Following the advice I found here, I tried using the grouping function as described below:

What you can do is add a new column to your metadata file that contains the new sample IDs. Then, when you run the group command, point the --m-metadata-column to the column containing your new sample IDs. The --p-mode parameter don’t matter (you can pick whatever you want) - because you are “grouping” the values into the same “groups”, just with new names, none of the abundances will change. Et voila! Moving forward, just drop the old sample ID column from your metadata and replace it with the new sample ID column.

I made a metadata file featuring just those two samples. It looks like this:

#SampleID                    Day    Treatment   Sex   Pen   NewName
Mock-community                                              Mock-community
mockrobiota-expected                                        Mock-community

I ran the following command:

qiime feature-table group
--i-table deblur/mockrobiota-table.qza
--p-axis sample
--m-metadata-file deblur/mock-and-mockrobiota-metadata.txt
--m-metadata-column NewName
--p-mode sum
--o-grouped-table deblur/mockrobiota-new-table.qza

Thank you so much for sticking with me so far. Now I decided to use Qiime 2 View to have a look at my new table because I couldn't visualise what I was trying to achieve. On inspection, the overview stated "Number of samples: 1, Number of features: 1, Total frequency, 11".

Weird, I thought. I would be expecting the number of features to be 20, no? I wonder if something went wrong. So I viewed the mockrobiota-relative-frequency.qzv file from earlier on, and it says "Number of samples: 1, Number of features: 1, Total frequency, 1". Now I'm really confused!

I don't even know where to go from here. I was originally writing this post to ask advice on where to go next, after adding my metadata column and making a new table, so that I could rename my Mockrobiota sample. But now I've discovered this anomaly, it's a more pressing issue! Should I just give up on q2-quality-control?? All I wanted was a nice graph showing theoretical composition of my mock community vs Qiime 2 output, but I can probably do that myself in Excel and I won't get that from q2-quality-control anyway, will I?!

Thank you so much for reading, for any advice and for maintaining this excellent forum! It has been invaluable to this noob.

Best wishes,
Lindsay

Nicholas_Bokulich · June 22, 2019, 5:08pm

Hi @xchromosome,
I think you took a wrong turn pretty early on. It sounds like you were taking the fastq sequence data from mockrobiota, processing it, and then attempting to use that as the "expected composition"? That's not correct — those data are essentially the equivalent of your own, being "observed" compositions of that mock community following some sequencing protocol.

Instead, you need to find the "expected taxonomy" file for that mock community and import that as a relative frequency table.

See this tutorial; it is using a fungal mock community but gives you a good idea of how the "expected taxonomy" from mockrobiota should be used:

yeah, q2-quality-control will show you actual accuracy scores (excel will not), but it will not make a bar chart of the compositions (excel will). You could do this with QIIME 2 taxa barplot, but it would take several steps to get your data in an appropriate format:

filter your feature table to only contain the mock community(s)
import the mockrobiota expected taxonomy as a feature table
merge the filtered and expected tables
make a modified taxonomy file for the expected taxonomy table. You would take the same expected taxonomy you used but copy the columns so you get something like this:

feature-id	Taxon
Bacteria;blah;blah;blah	Bacteria;blah;blah;blah
Bacteria;blah;blah;blah	Bacteria;blah;blah;blah
Bacteria;blah;blah;blah	Bacteria;blah;blah;blah

import that modified file as FeatureData[Taxonomy]
merge the taxonomy file for your observations with that modified taxonomy file
run qiime taxa barplot

Sorry! Not a direct route. mockrobiota was made in a pre-QIIME 2 era and has not been updated yet to conveniently mesh with QIIME 2; and QIIME 2 was certainly not built to cater to the needs of QIIME 2. One day this integration will occur...

xchromosome · June 23, 2019, 12:03pm

Hiya,

Thanks for the quick response, especially on a Saturday!

Yup, that's exactly what I was doing I was trying to follow the instructions I'd found from previous threads but I must have got mixed up!

Ok, I have got to the page where there are 2 things I can download: expected-sequences.fasta or taxonomy.tsv. I right-clicked on taxonomy.tsv and chose "save link as...".
According to the fungal ITS analysis tutorial, the next step is as follows:

Fungal ITS analysis tutorial

That file was annotated with an older version of the UNITE database, so let’s fix some of the annotations (to match the version we are using) before proceeding:

sed 's/s__Tylospora_asterophora/s__unidentified/' expected-taxonomy.tsv | sed 's/p__Zygomycota;c__Mortierellomycotina_cls_Incertae_sedis/p__Mortierellomycota;c__Mortierellomycetes/' | sed 's/c__Chytridiomycetes/c__Spizellomycetes/' | sed 's/g__Coprinopsis/g__Coprinopsis;s__unidentified/' | sed 's/g__Tricholoma/g__Tricholoma;s__Tricholoma_vaccinum/'> expected-taxonomy-mod.tsv

I don't really know what this means - is it relevant to me? I thought it might be something that only applied to older versions of Qiime 2 so I skipped it to see if I could. The next thing I ran was this:

biom convert \
  -i mockrobiota-taxonomy.tsv \
  -o mockrobiota-taxonomy.biom \
  --table-type="OTU table" \
  --to-json

This was the error message I got:

Traceback (most recent call last):
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/biom/parse.py", line 660, in load_table
table = parse_biom_table(fp)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/biom/parse.py", line 412, in parse_biom_table
t = Table.from_tsv(fp, None, None, lambda x: x)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/biom/table.py", line 4631, in from_tsv
t_md_name) = Table._extract_data_from_tsv(lines, **kwargs)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/biom/table.py", line 4747, in _extract_data_from_tsv
md_name = header[-1]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/qiime2/miniconda/envs/qiime2-2019.1/bin/biom", line 11, in
sys.exit(cli())
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/biom/cli/table_converter.py", line 114, in convert
table = load_table(input_fp)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/biom/parse.py", line 662, in load_table
raise TypeError("%s does not appear to be a BIOM file!" % f)
TypeError: mockrobiota-taxonomy.tsv does not appear to be a BIOM file!

Where have I screwed up now?!

It sounds like it would be beneficial to do both! That's my plan!

Massive thanks,
Lindsay

Nicholas_Bokulich · June 24, 2019, 12:04pm

No — that's specific to that mock community (the names listed in the expected taxonomy conform to an older version of the UNITE database and need to be updated).

What mock community are you trying to grab? You probably want one of the taxonomies formatted for a specific database, not the "source" taxonomy, so you may be grabbing the wrong file.

For mock-3, this appears to work:

wget https://raw.githubusercontent.com/caporaso-lab/mockrobiota/master/data/mock-3/greengenes/13-8/99-otus/expected-taxonomy.tsv

biom convert --to-json --table-type="OTU table" -i expected-taxonomy.tsv -o expected-taxonomy.biom

I hope that helps!

xchromosome · June 26, 2019, 10:42am

Hey,

I'm trying to get mock-22. I can't work out the difference between that and mock-20 - do you happen to know if they're the same? I just went with 22 because I figured: higher number = newer = better!

Nicholas_Bokulich:

For mock-3, this appears to work:

wget https://raw.githubusercontent.com/caporaso-lab/mockrobiota/master/data/mock-3/greengenes/13-8/99-otus/expected-taxonomy.tsv

biom convert --to-json --table-type="OTU table" -i expected-taxonomy.tsv -o expected-taxonomy.biom

Thanks - I tried replacing the "mock-3" with "mock-22" and the address works and shows the right composition but I notice some taxa have matches at species level and others genus - I used the SILVA database for my mock community and experimental samples and I see in the above link it says Greengenes - does that make a difference? And if so, is there a SILVA version?

Thank you!
Lindsay

Nicholas_Bokulich · June 26, 2019, 10:53am

I do not know off-hand if there are differences — you can check the "source" taxonomy in the data directory of each of those datasets to be sure.

Mockrobiota provides expected taxonomies formatted for different reference databases — greengenes and SILVA for all 16S rRNA mock communities. In many cases, the species added to the mock community will not be represented in these databases, and so the "expected" taxonomy will list the closest match, which may not be at the species level.

Mockrobiota also provides the "source" taxonomy listing the actual species/strains that were added to the community, but this is not what you want to use since it will almost certainly not match your observed results. Your observed results will be based on either greengenes or SILVA, and so can only achieve the level of resolution provided by those databases.

Yes, you should always match the database that you used. Here is the link that you want:

wget https://raw.githubusercontent.com/caporaso-lab/mockrobiota/master/data/mock-22/silva/123/99-otus/expected-taxonomy.tsv

Good luck!

xchromosome · June 26, 2019, 11:07am

Cool - I knew not all of the mock community members would be matched at species level which is why I wondered if the database mattered.

Thank you! Hopefully all should be good from now on.

xchromosome · June 30, 2019, 12:28pm

Hi, me again. Sorry.

I thought I had it sorted but when I tried to run it again:

qiime quality-control evaluate-composition /
--i-expected-features mockrobiota22-expected-taxonomy-NEW.qza /
--i-observed-features deblur/mock-table-NEW-relative-freq.qza /
--o-visualization QC-NEW-mock-mockrobiota-comparison.qzv

I got the following error:

/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_quality_control/_utilities.py:174: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

obs_collapsed.loc[sample]], axis=1).fillna(0)
Traceback (most recent call last):
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2cli/commands.py", line 274, in call
results = action(**arguments)
File "</home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/decorator.py:decorator-gen-195>", line 2, in evaluate_composition
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/sdk/action.py", line 231, in bound_callable
output_types, provenance)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/qiime2/sdk/action.py", line 427, in callable_executor
ret_val = self._callable(output_dir=temp_dir, **view_args)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_quality_control/quality_control.py", line 76, in evaluate_composition
plot_observed_features_ratio=plot_observed_features_ratio)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_quality_control/_utilities.py", line 104, in _evaluate_composition
results, vectors = _compute_per_level_accuracy(exp, obs, metadata, depth)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_quality_control/_utilities.py", line 162, in _compute_per_level_accuracy
obs_collapsed = _collapse_table(obs, level)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_quality_control/_utilities.py", line 305, in _collapse_table
table.columns, index=table.columns, name='Taxon'), level)
File "/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_taxa/_method.py", line 27, in collapse
(level, max_observed_level))
ValueError: Requested level of 2 is larger than the maximum level available in taxonomy data (1).

Plugin error from quality-control:

Requested level of 2 is larger than the maximum level available in taxonomy data (1).

See above for debug info.

I tried searching the forum but the only other person who reported this error had a different version (level 5 vs my level 2) and you suggested to him that he had collapsed his taxomomy to level 4. I haven't done any collapsing, unless it can happen without me knowing!

Any ideas?
Thanks again,
Lindsay

thermokarst · July 2, 2019, 12:57pm

Hi @xchromosome! @Nicholas_Bokulich is out of the office right now, I will try and lend a hand

Have you had a look at the help text (--help)? Looks like there is a parameter for controlling the depth:

  --p-depth INTEGER    Maximum depth of semicolon-delimited taxonomic ranks
                       to test (e.g., 1 = root, 7 = species for the greengenes
                       reference sequence database).              [default: 7]

So, it sounds like (from your error message above), that you only have one taxonomic level, one option is to run with --p-depth 1. I will mention though that one taxonomic level seems odd to me, and it strikes me as an indication that something went wrong with one of your pre-importing steps (but I could be mistaken).

xchromosome · July 2, 2019, 2:12pm

Hey,

Thanks! Yeah I knew about the depth parameter but the way I understood this was that level 1 = kingdom level? Is that right? Or does it mean that all my features are at the same taxonomic level (eg. all at level 6)? If the former, would that mean comparing two tables of, say, 18 features vs 23 features, where all the features are just "bacteria"?

I'm also not sure which table the above error message is referring to - the mockrobiota one or my real mock community?

I used qiime feature-table summarize to look at the contents of the tables and I see that in the Mockrobiota "expected features" table, features have taxonomic assignments, e.g. "D_0_Bacteria;D_1_Firmicutes;D_3_Bacillales;D_4_Bacillaceae;D_5_Bacillus;D_6_Bacillus cereus
whereas my actual mock community "observed features" table lists features like: "d23fbef2f31d48eda40876cdbc49933a". Is this likely to be a/the problem?

Thank you,
Lindsay

thermokarst · July 2, 2019, 11:15pm

Yep! The feature IDs in this table need to be a semicolon delimited taxon strings. If you haven't performed taxonomic classification of your features yet, you will need to do that before you proceed. Once you have that, you can run taxa collapse on your feature table, then feature-table relative-frequency, and then quality-control evaluate-composition. Hope that helps!

xchromosome · July 5, 2019, 3:55pm

Thank you! It worked!

I collapsed my mock community table to level 7, however looking at the outputs (which I'm struggling to understand to be honest but I'm doing my best!) I'm thinking I should try just doing genus level. So I collapsed my table to level 6, converted to relative frequency and tried running again. Then I got the error:
Plugin error from quality-control: Requested level of 7 is larger than the maximum level available in taxonomy data (6).

I'm guessing what this means is that there's a mismatch between my mock community table at level 6 and the Mockrobiota-22 table, which is at level 7. Is that right? The problem is, I don't think I can collapse my Mockrobiota-22 table, because 1) I need the FeatureData[Taxonomy] file relating to the Mockrobiota table, which I don't have because I just downloaded the table from the website. And 2) qiime taxa collapse expects a FeatureTable[Frequency] and I only have the relative frequencies and no way of converting this (as far as I know!).

Is there a way around this or is this the end of the road for my analysis?
Many thanks!

Nicholas_Bokulich · July 5, 2019, 5:43pm

No, this is because --p-depth is set to 7 by default. You do not need to collapse your table at genus level — you can just set the --p-depth parameter to 6 to perform comparisons up to that level.

Good luck!

xchromosome · July 7, 2019, 4:00pm

D'oh! Obviously! I forgot about the --p-depth parameter! Haha, thank you!

All worked fine. There was only one genus that was expected in the mock community that didn't appear at all (Propionibacterium). However, I notice that Neisseria and Helicobacter appear in both the "False positives: missclassifications" AND the "False negative features" sections, which is confusing me somewhat. Am I right in thinking that missclassifications = genera detected in my mock community that weren't expected? Eg. chimeras. And false negatives = genera that were expected, but not observed. Both Neisseria and Helicobacter WERE expected AND observed.
Here are my files:
mock-table-collapsed-species-level-relative-freq.qzv (336.2 KB) QC-mock-mockrobiota-comparison-genus-level.qzv (369.5 KB)

I'm also struggling to understand the meaning/significance of the r-squared values on the first graph. I've been watching loads of videos on r-squared and think I get what it means in the context of a linear regression, but I can't figure it out in the context of this graph at all.

Lastly (and I promise this is the very last thing I will ask!!), in the table with r-values, P values, Bray-Curtis etc., what do the p-values refer to?

Thanks for answering all my noob-like questions and I promise I have been struggling, reading, searching the forum A LOT over the last few months and sorted out 99% of problems on my own! I've really struggled to find enough help out there for q2-quality-control (and Gneiss!) which I find really surprising! Does everyone else just breeze through it and it's just me who gets stuck, or are these plugins just not that widely used?!

Massive thanks again,
Lindsay

Nicholas_Bokulich · July 8, 2019, 3:00pm

Hi @xchromosome,

Your definitions are correct. I like this problem because it shows us that computers are don't really have much sense — they take everything too literally!

You see these genera as both false negatives and as misclassifications (a type of false positive) because:

although the genus names match
the rest of the taxonomy string does not match!

So:

the computer sees these as mismatches because it looks at the whole taxonomy annotation, not only the deepest level being compared
this may be an annotation error in the reference database that you are using (e.g., why are there Helicobacter being annotated as Proteobacteria and others as Epsilonbacteraeota?) but the database curators may have good reason for this... I don't know! You'd better double-check the latest SILVA reference database to see that both of those taxonomy annotations exist there and then ask the SILVA curators.
As a "quick fix", you can just swap out the taxonomy annotations in your "expected taxonomy" file (and re-do the process of converting/importing), and your F-measures will improve, since this is a simple case of having the wrong name in the database.

That is the r-squared value for linear regression between the expected and observed relative abundances of each of your taxa, at each taxonomic rank shown on the plot (e.g., level 6 = genus). So look at the second plot for more context on what those linear regressions actually look like.

The p value of the pearson linear regression test

I think these plugins are just not that widely used! So we appreciated you using these, and for struggling through these questions — you can think of yourself as a pioneer and your questions/answers will help others who follow in your footsteps! Feel free to write up some more notes/advice here for others who may run into some of these same problems (actually, you could even put together a tutorial and post it to the community tutorials section of the forum!)

Thanks!

xchromosome · July 16, 2019, 10:39am

That's great! Thank you so much for all your help!

This is something I definitely want to do! I've learned so much from reading other people's questions. I'm so glad others have asked questions in the past and followed up afterwards too. Hope you have a super day!