Mistake in metadata file and re-running core metrics analysis.

xchromosome · May 28, 2019, 4:21pm

Hello!

I’ve got myself in a bit of a muddle with three separate issues going on. Huge apologies for the length of this, but I feel like one solution might cover all of them. I have already searched relentlessly for a solution on the forum but haven’t found anything to help.

My first issue is that I just have got to the end of my initial analysis -assigned taxonomy, done alpha and beta diversity measurements, etc. - and now I have just realised that one of my samples has been categorised wrongly in my metadata file. (Sample D23_70_C was categorised as treatment 1 when it should have been treatment 3). I have now edited the metadata file, but obviously all of my analyses are invalid.

Is there any way of fixing this without having to start way back at the very beginning? Especially without having to assign taxonomy again, because my computer lacks the RAM and I had to get someone else to do it for me last time, but he’s no longer here.

Problem 2. In addition, I see that my alpha and beta diversity measurements on this full data set but these are just on the main effects - i.e. if I view the graphs on Qiime View and group by treatment, this will include both time points. I should note that I have 4 treatment groups. And if I look at the graph of Day 23 vs Day 37, this includes all the treatments together. I need to be able to separate these out and look at the treatments on each day, so I wanted to split my data into two - one for each time point. I noticed on another post here that this person had a similar request and was advised to add an extra metadata column to give more groups. In my case, this would be "DayxTreatment" and contain 8 different values (2 time points x 4 treatments). Would I be able to employ this tactic? I have already added the column to the metadata file (which is when I stumbled upon problem #1) but haven't taken it any further yet. I noticed in another post here that this person was making two new tables to analyse them separately, but then this involves building a new tree etc. Which approach would be more appropriate for me?

Problem 3. At the end of the analysis, I realised I should have filtered out all chloroplast and mitochondrial reads. I have now done this on my table using qiime taxa filter-table and on my rep-seqs file using filter-table filter-seqs.

I have re-run my core metrix analysis, using my new table with no choloplast reads and the D23_C_70 sample being in the right treatment group. But I have noted here in the protocol (from a previous postdoc) that after filtering my rep-seqs file, I have to re-run the taxonomic clasification. Is this right?

Massive thank you for any help you can offer me!
Lindsay

jwdebelius · May 28, 2019, 7:18pm

Hi @xchromosome,

It sounds like a lot, but let's work through!

Let me start by giving some general feedback that hopefully will set your mind (and computation) at ease. Taxonomic assignments, trees, and tables don't need to be re-built after filtering for features. You can pass a tree that's a superset of the features it your table. You should be able to have names for sequences you don't have in your label (and if Im wrong and this is screwy, you can just filter rather than having to rebuild.)

Ouch! This happens. Its always hard, but its good to catch it now, if you can. So, yay for finding it. However, it doesn't invalidate your analysis.

Your tables, tree, and taxonomy should be metadata agnostic. You can (although I personally discourage it) generate feature tables, trees, taxonomic assignment, and even diversity without ever having to know anything about a sample but the name and barcode. Of course, once you get to that point, you're kind of SOL, but you can get there.

Good that you re-ran your diversity analyses; these definitely need to be re-done. You do not need to re-run your feature classification, though. You already filtered out the taxa you don't want. At worst, you'll need to filter your feature data. If you'd filtered your sequences and then gone in and done de novo OTU clustering, then you would need to re-do your classification, but since you filtered the table, there's no need. (You also don't need to re-build your tree, the algorithms will just prune it for you.)

So, it looks like this step is good.

xchromosome:

Problem 2. In addition, I see that my alpha and beta diversity measurements on this full data set but these are just on the main effects - i.e. if I view the graphs on Qiime View and group by treatment, this will include both time points. I should note that I have 4 treatment groups. And if I look at the graph of Day 23 vs Day 37, this includes all the treatments together. I need to be able to separate these out and look at the treatments on each day, so I wanted to split my data into two - one for each time point. I noticed on another post here that this person had a similar request and was advised to add an extra metadata column to give more groups. In my case, this would be “DayxTreatment” and contain 8 different values (2 time points x 4 treatments). Would I be able to employ this tactic? I have already added the column to the metadata file (which is when I stumbled upon problem #1) but haven’t taken it any further yet. I noticed in another post here that this person was making two new tables to analyse them separately, but then this involves building a new tree etc. Which approach would be more appropriate for me?

This is a place I think breaking out of core diversity may help you. I would both try adding the extra column and separating by treatment. You don't have to re-calculate the distance matrix fi you're not changing the underlying feature table (features included, rarefaction), you can just use qiime diversity filter-table. Because calculating distance matrices take a long time and tend to be computationally intense, in my own analysis, I work really hard to only calculate a distance matrix once and then just filter it for whatever I need.

Once you've got your filtered distance matrix, you will need to calculate a new PCoA and run new statisticis (permanova, etc). PCoA is a projection based on the distances in your dataset, and the addition of a point can shift your PCoA. So, that needs to be re-done every time. Luckily, it's not terribly computationally intense and it's pretty quick. (Check out qiime diversity pcoa. And, if you discover another issue with your metadata, you can actually just update the emperor plot for the same PCoA using the new metadata.

I would also suggest looking at q2-longitudinal because I think if you've got paired samples, you should use them! Paired samples may help decrease some of the noise and give you all sorts of shiny statistical benefits (like breaking some of the obnoxious properties of distance matrices, for instance).

So, as a wrap up:

Mistakes happen, you found yours and fixed them, yay!
You don't need to re-do taxonomic classification or build a new tree unless you're filtering before your ASV table.
You don't need to calculate a new distance matrix unless you're changing your feature table (like filtering features) or rarefaction depth
You do need to do a new PCoA if you change your sample set. You also need to do new statistical tests.
You pick up power with your paired samples, use them!

Best,
Justine

xchromosome · May 30, 2019, 1:12pm

Hi Justine,

Thanks so much for your comprehensive reply!

Just to clarify, when you say try adding the extra column and separating by treatment (was this a typo and you meant time point? That seems more suitable for my dataset), do you mean first re-running core-metrics-phylogenetic on my "full" table with all the samples, but with an extra metadata column for DayxTreatment, calculate alpha & beta diversity group significance using, eg. Shannon, UniFrac etc.... and then after that, try it the different way where I filter my table to generate two new tables, eg. Table-Day-23-Only.qza and Table-Day-37-Only.qza and run the core-metrics-phylogenetic on those separately, do alpha/beta significance again on them and see how the results of using the two different methods compare? (And then doing more stuff... but just as a starting point!) If so, I'm a bit confused when you talk about re-calculating the distance matrix. What exactly IS the distance matrix? I thought they were the outputs from the qiime diversity core-metrics-phylogenetic command? Eg. Bray Curtis, weighted UniFrac etc.? So I have many? I feel I have the wrong end of the stick here.

If I have made two new tables containing half of my original data then will that not mean that some of the features will change because there will be ones present at one time point and not the other? Also what's the difference between using qiime diversity filter-table or doing this:

qiime feature-table filter-samples
--i-table table.qza
--m-metadata-file metadata.txt
--p-where "Day='Day23'"
--o-filtered-table Table-Day-23-Only.qza

Are they achieving different things? Sorry, I'm so confused! I thought I had this all sorted and then everything turned upside down! (And I'm ill and sleep deprived so apologies if my brain is just not functioning properly here!)

Time isn't really an issue actually - I have plenty of other stuff I can work on while waiting for my computer spit out results! So if it's easier to start again at a reasonable stage then this is an option, if it's simpler for my clearly failing brain.

Huge thank yous!!
Lindsay

xchromosome · May 30, 2019, 1:30pm

Oh also! I will definitely take up your suggestion of using q2-longitudinal too once I have got myself out of the tangle!

Thanks!
Lindsay

jwdebelius · May 30, 2019, 2:19pm

Hi @xchromosome,

I'm just going to preface this with "maybe you want to get a cup of coffee " because it seems like a long answer as I write it. Im also going to apologise for my spelling mistakes in advance, because I know I will not see all of them until the post gets locked in like a month.

So, first, if I've made a typo related to treatment or timepoint, I apologize. This depends on your hypothesis, so I'll leave that judgement to you.

I also think it might help if you understood a little bit more about core diversity analysis, and what's happening under the hood. Here's what it's doing:

Performs rarefaction (qiime feature-table rarefy).
This outputs core-metrics-results/rarefied_table.qza
For each metric in alpha diversity, calculate alpha diversity using the rarified table .
- Applying qiime diversity alpha-diversity outputs
  - core-metrics-results/shannon_vector.qza
  - core-metrics-results/observed_otus_vector.qza
  - core-metrics-results/evenness_vector.qza
- And then you use qiime diversity alpha-phylogenetic to get
  - core-metrics-results/faith_pd_vector.qza`
For each metric in beta diversity
1. calculate distance (beta diversity) using the rarefied table
  - qiime diversity beta gives you
    - core-metrics-results/bray_curtis_distance_matrix.qza
    - core-metrics-results/jaccard_distance_matrix.qza
  - qiime diversity beta-phylogenetic
    - core-metrics-results/unweighted_unifrac_distance_matrix.qza
    - core-metrics-results/weighted_unifrac_distance_matrix.qza
2. Perform PCoA analysis to, via the magic of ordination, compress the data into a semi-human viewable 3D space qiime diversity poca (The math under the hood is still kinda magic to me )
  - core-metrics-results/bray_curtis_pcoa_results.qza
  - core-metrics-results/jaccard_pcoa_results.qza
  - core-metrics-results/weighted_unifrac_pcoa_results.qza
  - core-metrics-results/unweighted_unifrac_pcoa_results.qza
3. Use Emperor (qiime emperor plot) to generate a PCoA visualization that you can view.
  - core-metrics-results/bray_curtis_pcoa_results.qzv
  - core-metrics-results/jaccard_pcoa_results.qzv
  - core-metrics-results/weighted_unifrac_pcoa_results.qzv
  - core-metrics-results/unweighted_unifrac_pcoa_results.qzv

What that means is that you can stop at any point in the process and work with the table on its own. You don't need to re-run core diversity from scratch each time.

If you check the help documentation for each command, you might find that the only command in the core-metrics workflow that requires metadata is qiime emperor plot. So, if you've re-done your metadata, you can just run qiime emperor plot on each of your four metrics, and get new PCoA visualizations. You don't have to run the command each time as long as you're working with the same PCoA.

So, beta diversity is a between sample comparison. For each pair of samples, we can measure similarity in a number of ways. Except that for math-y reasons, usually we work in disimilarity, which is (1-simimilarity). Even more mathematically cool, most of the ways we measure dissimilarity actually qualify for the properties of distances (the wikipedia page is a good primer, but its not entirely relevant here, just trust me on this?). So, the distance matrix just the collection of all the pairwise distances between samples.

We can use this representation for a bunch of things. We can do statistical tests directly on the distance (permanova, permdisp, adonis, etc). But, it can also be hard to visualize distance. Heat maps and clustering can be helpful, but whats sometimes really nice is to take all the complex data and just... project it into 3D space to look for seperations and gradients. That's where we get our PCoA.

One good way I like to think about this is that the distances are the little notation in my atlas that tells me how many miles between cities and the PCoA is like a map . The problem here is that, unlike a map of physical locations, the way we do the projection into PCoA space doesn't have fixed reference points (like North, South, East and West), so what we see in space is some sort of transformation of the data, but can change based on what we choose to show.

With regard to filtering, cause that can also be confusing...

If you filter samples, you're not changing your features. Some features may be 0 in the new table, because they weren't present to begin with, but it doesn't change the per-sample count.
If Im comparing my book collection with my siblings and somehow, magically, none of us manage to buy more before our comparison ends, the books I have that are different from my sister aren't affected by whether or not I'm also comparing books with my brother. And, if he has a book neither of us have, then there is no difference. The features in your table aren't changing if you remove samples, just like the books in my library don't change if only my sister and I compare books.

I can also filter my distance matrix, or my between-sample comparison. You can filter this distance matrix to get a subset of distances. It's the same as if I only look at a distance table for California. It doesn't affect the cities the distance between a city in Cali and one in Arizona , it just means we don't see Arizona.

Okay, so I'm guessing you meant qiime feature-table filter-filter-features, but Im not toally sure, so I just pulled the help documentation (god, I love help documentation, I cant do an analysis without it) for both commands.

Usage: qiime feature-table filter-samples [OPTIONS]

  Filter samples from table based on frequency and/or metadata. Any features
  with a frequency of zero after sample filtering will also be removed. See
  the filtering tutorial on https://docs.qiime2.org for additional details.

and

Usage: qiime feature-table filter-features [OPTIONS]

  Filter features from table based on frequency and/or metadata. Any samples
  with a frequency of zero after feature filtering will also be removed. See
  the filtering tutorial on https://docs.qiime2.org for additional details.

So, qiime feature-table filter-samples removes samples. And then, if something is all 0s after we kick out the samples, it removes that feature. But, that doesn't matter for our diversity calculations, because it was all 0s anyway and the distance between 0 and 0 is 0. This is my sister and I kicking our brother out of a discussion about libraries .

And then, qiime feature-table filter-features will remove features (ASVs, OTUs, KEGGs, books from my sister's libraries, etc) from the table. Maybe you want to remove low frequency features (present in less than 10% of your samples) or maybe you want ot focus on only one taxa type (my sister just texted and said we could only compare books about dragons ). However, if after filtering, there are no more features, the sample name doesn't get retained in the table.

Try checking the usage for qiime diversity filter-distance-matrix and see if you can use that to figure out how it works.

I, erm, may have a bad history of, umm, burning through laptops by running computational intensive things on them . It also becomes more of a problem when you work on large datasets. It may not matter right now if you have 100-200 samples. But, if you've got 1000 or 10,000, the amount of time it takes ot do the distance calculation increases for each pair (because they're pairwise comparisons) and so suddenly, it takes really long to re-run. So, its a good habit to get into, I think.

Hopefully this helps some, but please, let me know if you've got more questions!

Best,
Justine

xchromosome · June 2, 2019, 4:53pm

Hey,

Thanks again! I’ve understood everything you said and appreciate the analogies (always my go-to way of explaining things to people too!)

So I understand the distance matrix and what this is. I took your advice and tried adding a metadata column called TreatmentxDay which I can use to further split my samples into groups and compare, say, Treatment1-Day23 with Treatment2-Day23. I’ll call this method “the lazy way” for the rest of my post.

Then I used qiime diversity filter-distance-matrix on my Bray-Curtis and Weighted UniFrac distance matrices, to create new distance matrices containing Day 37 only and Day 23 only samples, and then ran qiime diversity beta-group-significance on these, on treatment group. Then I compared the two methods. Is this what you were suggesting? I hope so because it took ages, haha! The results were that with the Bray Curtis results, it made no difference. The pseudo-F and P values for the two particular comparisons I looked at were identical. Yay. For the weighted UniFrac, however, the P value for Treatment1 vs Treatment2 using “the lazy way” was 0.02, whereas for the long way (filtering and focusing on each time point on its own), it was 0.015. For Treatment3 vs Treatment4 using the lazy way P = 0.003 and for the long way, P = 0.001 Any idea why it would be different? Would the different q values affect it?

Many thanks!
Lindsay

jwdebelius · June 4, 2019, 8:26am

Hi @xchromosome,

I'm glad it made sense!

When you're testing beta diversity, you use a permutative test. You essentially shuffle your data and then see if your statistic of choice is more extreme that what you'd get at random. In QIIME, the p-value gets calculated as (1 + more extreme)/(1 + permutations). So, if you get a p-value of 0.003, with 999 permutations, it means that 2/999 were more extreme that your original value (2 + 1)/(999 + 1) = 0.003. One problem with p-values in this method is the strength of your p-value below the threshhold doeesn't tell you as much about the strength of the association. So, a p-value of 0.001 might be similar to a p-value of 0.002, or it might mean that there are only 1/100000 permutations that are significant and you just didn't sample them. (For most purposes, Im not sure it matters and somewhere in the distant past of notes three moves ago I had notes about this I need to dig out.) So, that means with weighted UniFrac distance, you might just be seeing something that butts up against the permutation limit and by chance/seed/random process, you get a slightly different set of permutations based on the test.

If you want explanatory power/an effect size approximation, you might want to look into Adonis. Ive been using it recently and really like it because it gives me an R^{2} in addition to a p-value. I think He et al did sort of a cool thing with it.

Best,
Justine

xchromosome · June 6, 2019, 10:43am

Awesome, thank you!

One last thing - with regard to using q2-longitudinal, although I have two time points, the actual animals aren't the same animals. We had 80 pens of 15 chickens each, and sacrificed one of the pen at day 23 and another at day 37. Each chicken was taken as a representative of the pen, but obviously weren't the same individual. Do you think q2-longitudinal would still be appropriate?

Thank you!
Lindsay

jwdebelius · June 6, 2019, 1:13pm

Hi @xchromosome,

In that case, I don't think you have truly paired samples. The power you get from pairing comes from the ability to average out an individual's microbiome, leaving just what's changed. But, since you have different chickens, it won't work quite as well.

BTW, out of curiosity, are chickens coprophagic?

Best,
Justine

xchromosome · June 7, 2019, 12:53pm

Hi Justine,

Thanks - that's a shame. It's nice to know that it exists for future studies though. Yes, chickens are coprophagic!

Best regards
Lindsay

system · July 8, 2019, 7:04pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.