What to do when your expecting expected sequences you don't have

devonorourke · December 5, 2018, 9:13pm

In the beginning of the quality control tutorial we learn first how to exclude sequences by alignment. The output of the qiime quality-control exclude-seqs process generates a suite of outputs, but I'm thinking that the qc-mock-3-expected.qza and qc-mock-3-observed.qza files are not part of the output (rather, they are just part of the series of files downloaded in the beginning). Is this correct?

I'd like to be able to run the qiime quality-control evaluate-composition function, but I'm not clear what this sentence is indicating:

Typically, feature composition will consist of taxonomy classifications or other semicolon-delimited feature annotations

The help menu indicates that it wants these to be FeatureTable[RelativeFrequency] types, which, in the case of the observed data I certainly have. However, my mock communities aren't from a manufacturer where the expected proportions are necessarily exact, thus I don't know what to input/create for the expected data. Further, the mock community here is made of arthropod amplicons, and there absolutely is going to be some primer bias (so even if I knew something about the molarity of each mock member, it's not particularly worthwhile except to demonstrate that there is primer bias, which, I guess is kind of neat neat...).

One thing I do have: multiple independent sequencing runs of the same mock community. I could generate a frequency table of expected numbers of reads using the average of each of these, but I'd probably want to normalize the counts first before doing something crazy like that.

Thanks for any advice on if and how to possibly implement.

Nicholas_Bokulich · December 5, 2018, 9:45pm

Correct. exclude-seqs is a separate command, unrelated to the evaluation methods elsewhere in the tutorial.

Hm... do the best you can. Expected copy # would be ideal, but cell count works... as close as you can get.

This is always a problem with mock communities and any molecular method. So don't worry about that.

Examining variation across runs is useful, but not for composing expected composition. I can see some sense there, e.g., for evaluating consistency of other sequencing runs, but you could do that with any sort of sample so it really gets away from the point of a mock community, where you know the input composition (with some inevitable degree of error/methodological bias to be expected).

devonorourke · December 5, 2018, 9:50pm

Okay, sounds good.

This last bit is connected to this thread.

Maybe I should just stick to TaxCredit for both cross-validated and Mock analyses? Is there a downside to TaxCredit for mock analyses over the QIIME method?

The goal, after all, is to decide which taxonomic classifier to pick. Thus, my question about which anslysis to use here boils down to understanding what tool best evaluatives a classifier... which probably doesn't have one answer! But, given my small mock community, perhaps one is better suited than another.
Or, I try both...

Nicholas_Bokulich · December 5, 2018, 9:58pm

There is no better way... tax-credit is probably better in that it is more formalized and you can build on existing workflows. The only downside of tax-credit is that it is a little more difficult to use — it should be fairly straightforward but you need to be familiar with working with jupyter notebooks, and you will need to export your QIIME 2 data into compatible formats. Not a big deal.

But if you are already planning on using tax-credit for cross-validation then stick with it for mock communities... would be easier than trying to hash together a workflow with QIIME 2.

devonorourke · December 6, 2018, 1:02pm

No cell counts for me - my mock sequence data is generated from amplicons which were generated from plasmids. The plasmids were constructed on a per-insect basis, where a PCR product from that particular insect's COI amplicon was cloned into a single plasmid.
The mock communities I'm working with are composed of equimolar concentrations of those plasmids. My thinking is that the expected copy number should vary only by the input concentration of the sample if those samples were run on a dedicated lane. (Unfortunately?) these mocks were spiked in with other (bat guano COI) samples in each library; thus the expected number of sequences is variable depending on both how many samples I included, as well as the depth of sequencing I requested per library (% of a MiSeq/HiSeq lane varied depending, in part, on how many samples I submitted).

So... I have no idea what an absolute expected copy number is, but the relative copy number is 1. Any recommendations for what I do with that?

Nicholas_Bokulich · December 6, 2018, 1:35pm

Oh right. Amplicon sequencing cannot reliably deliver absolute abundances on its own — so for all mock communities generated by this method we can only use expected relative abundances. That is what the evaluate-composition method expects. So sounds like you have what you need!

devonorourke · December 6, 2018, 1:43pm

Thanks for the clarification.
I feel like you've told me the answer 3 different ways but it's still not sinking in. What values do you recommend I use for the "expected" abundances? Do I just take the average of all the reads associated with all mock sequences?

For example, if I had a mock sample with just 5 members, and the frequency table showed these data:

## observed sequences
SampleID | ASV1 | ASV2 | ASV3 | ASV4 | ASV5 |
mock     | 2000 | 8000 | 3000 | 2000 | 5000 |

If I have a total of 20,000 reads, and there are 5 samples that were spiked in equimolar fashion, wouldn't the expected number just be 4,000 reads per ASV?

## expected sequences
SampleID | ASV1 | ASV2 | ASV3 | ASV4 | ASV5 |
mock     | 4000 | 4000 | 4000 | 4000 | 4000 |

If that's the case, that would be an easy enough abundance table to create. Perhaps it's not that simple though...

Thanks!

Nicholas_Bokulich · December 6, 2018, 1:48pm

Yes! These are equimolar so you expect equal relative abundances. This needs to be a FeatureTable[RelativeFrequency], though, so in a 5-member mock community with equimolar abundances the expected frequency of each would be 0.2, not 4000.

Yes! It is that simple.

I hope that helps!

devonorourke · December 6, 2018, 1:53pm

Awesome.

alas, I can't follow your super simple math

Is the relative frequency comparing proportions of reads, per sample, attributed to a given ASV? If that's the case, isn't the expected proportion of 20,000 reads, split among 5 samples, 4,000? And then the fraction (relative abundance) of that just 4000/20000 = 1/5 = 0.2?

You wrote 0.5, which I why I'm all thrown off.

For what it's worth, I got a C- in my one and only college calculus class, so I generally trust others maths.

Nicholas_Bokulich · December 6, 2018, 2:03pm

no, I can't do math before I've eaten breakfast

That's correct. I have fixed above for clarity

Big mistake! At least before they are fully awake.

devonorourke · December 6, 2018, 2:07pm

I can rent you a 3.5 year old which will ensure you need zero cups of coffee starting at 545am.
She is immune to jetlag, requires about 7 hours to recharge her batteries, and can function exclusively on lollipops. Like glitter covered Energizer bunny who sells nothing but chaos mixed with uninhibited love.

The only joke in the above statement is the required hours of sleep (it's less than 7).

Thanks for the quick replies.

Nicholas_Bokulich · December 6, 2018, 2:14pm

The only payment I can offer for said rental is my own 3.5 yr old with the same running conditions. So by "before... breakfast" I meant "until next decade".