Mistake in metadata file and re-running core metrics analysis.

jwdebelius · May 28, 2019, 7:18pm

It sounds like a lot, but let's work through!

Let me start by giving some general feedback that hopefully will set your mind (and computation) at ease. Taxonomic assignments, trees, and tables don't need to be re-built after filtering for features. You can pass a tree that's a superset of the features it your table. You should be able to have names for sequences you don't have in your label (and if Im wrong and this is screwy, you can just filter rather than having to rebuild.)

Ouch! This happens. Its always hard, but its good to catch it now, if you can. So, yay for finding it. However, it doesn't invalidate your analysis.

Your tables, tree, and taxonomy should be metadata agnostic. You can (although I personally discourage it) generate feature tables, trees, taxonomic assignment, and even diversity without ever having to know anything about a sample but the name and barcode. Of course, once you get to that point, you're kind of SOL, but you can get there.

Good that you re-ran your diversity analyses; these definitely need to be re-done. You do not need to re-run your feature classification, though. You already filtered out the taxa you don't want. At worst, you'll need to filter your feature data. If you'd filtered your sequences and then gone in and done de novo OTU clustering, then you would need to re-do your classification, but since you filtered the table, there's no need. (You also don't need to re-build your tree, the algorithms will just prune it for you.)

So, it looks like this step is good.

xchromosome:

Problem 2. In addition, I see that my alpha and beta diversity measurements on this full data set but these are just on the main effects - i.e. if I view the graphs on Qiime View and group by treatment, this will include both time points. I should note that I have 4 treatment groups. And if I look at the graph of Day 23 vs Day 37, this includes all the treatments together. I need to be able to separate these out and look at the treatments on each day, so I wanted to split my data into two - one for each time point. I noticed on another post here that this person had a similar request and was advised to add an extra metadata column to give more groups. In my case, this would be “DayxTreatment” and contain 8 different values (2 time points x 4 treatments). Would I be able to employ this tactic? I have already added the column to the metadata file (which is when I stumbled upon problem #1) but haven’t taken it any further yet. I noticed in another post here that this person was making two new tables to analyse them separately, but then this involves building a new tree etc. Which approach would be more appropriate for me?

This is a place I think breaking out of core diversity may help you. I would both try adding the extra column and separating by treatment. You don't have to re-calculate the distance matrix fi you're not changing the underlying feature table (features included, rarefaction), you can just use qiime diversity filter-table. Because calculating distance matrices take a long time and tend to be computationally intense, in my own analysis, I work really hard to only calculate a distance matrix once and then just filter it for whatever I need.

Once you've got your filtered distance matrix, you will need to calculate a new PCoA and run new statisticis (permanova, etc). PCoA is a projection based on the distances in your dataset, and the addition of a point can shift your PCoA. So, that needs to be re-done every time. Luckily, it's not terribly computationally intense and it's pretty quick. (Check out qiime diversity pcoa. And, if you discover another issue with your metadata, you can actually just update the emperor plot for the same PCoA using the new metadata.

I would also suggest looking at q2-longitudinal because I think if you've got paired samples, you should use them! Paired samples may help decrease some of the noise and give you all sorts of shiny statistical benefits (like breaking some of the obnoxious properties of distance matrices, for instance).

So, as a wrap up:

Mistakes happen, you found yours and fixed them, yay!
You don't need to re-do taxonomic classification or build a new tree unless you're filtering before your ASV table.
You don't need to calculate a new distance matrix unless you're changing your feature table (like filtering features) or rarefaction depth
You do need to do a new PCoA if you change your sample set. You also need to do new statistical tests.
You pick up power with your paired samples, use them!

Best,
Justine