choosing phylogeny for core-metrics

Heather_E · April 13, 2020, 8:23pm

Hello all. I am running core-metrics on some data and get slightly different results based on my rooted tree choice. But, I am not sure which tree I should be using. In the first run, I created a rooted tree using the align-to-tree-mafft-fasstree command and rep-seqs from my entire dataset (all fish collected from a river), but ran core-metrics on a table that included a subset of my original samples (blue catfish only). In the second run, I created a new tree using only sequences from my subset, then ran core-metrics again. The results were slightly different and I am wondering how the choice of tree influences that output since the table with the features I wanted considered was the same in both. And which tree would be the better choice to use - the tree using my entire data set which would be a fuller tree, or a smaller tree that contains sequences specific only to my subset?

The link for the first run is BCF-faiths-sig.qzv (1.4 MB)

and the second run is faiths-sig.qzv (1.4 MB)

Collection Site is the particular metric I am looking at. Thanks for any insight.

jwdebelius · April 14, 2020, 1:05am

Hi @Heather_E,

There are two possible sources of difference here. The first is potentially in the tree alignment, although to some degree, you expect the distance between a pair of sequences (and therefore the alignment reconstruction) to be consistent regardless of other elements in the tree. This is not always strictly true, but it's usually close enough for government work.

The second potential source is rarefaction. This is a stochastic process, and so you introduce some error with the set of sequences that get selected during rarefaction. It may also explain slightly different results.

I tend to prefer to work with a larger tree. Mostly because I'm lazy, trees are computational expensive, and then I only have to compute one once no matter what sets of data I want to analyze or re analyzed.

Best,
Justine

Heather_E · April 14, 2020, 2:12am

Thanks for your input, Justine. Rarefaction shouldn't have been an issue here since I actually pulled my subset from the rarefaction table created when I ran core-metrics on the whole dataset. Then, when I ran core-metrics on the subset, I just set my sampling depth to the same number. Thus, the sequences chosen should all be the same. There must be some slight variation in the tree. But, I think you are right that it will make for a cleaner analysis if I just stick with the larger tree.

jwdebelius · April 14, 2020, 2:17am

Hi @Heather_E,

Rarefaction is a random process. I can take same data and re-rarify 5 times and I will get slightly different compositions. Sequences are chosen at random, meaning that a re-sampling may select slightly different sequences. If you don't believe me, try running the core-diversity-metrics command on the Parkinson's mouse tutorial and comparing your result to the tutorial result. They will be slightly different because you used different rarefactions.

So, that's where I'd assume the larger element of randomness.

Best,
Justine

Heather_E · April 14, 2020, 3:08am

Hi Justine,

Yes, I understand that rarefaction is a random process, but when you run the core-metrics command, one of the outputs is a rarefied table. Let's say I set my sampling depth to 5000. In the rarefied table, all samples will now have exactly 5000 features, randomly sampled from the original dataset. As I understand it, all of the other outputs in that core-metric folder will use this rarefied table to run the rest of the metric data (faith's, shannon, etc.) so that all processes are using the same input.

So, when I subsampled, I used this rarefied table as my starting point so that I was pulling a specific species out of the table, but using the same rarefied data. My new table had one species, all samples with the same 5000 features they had in the original rarefied table. Then, when I reran the core-metrics, I just set my sampling-depth to 5000 again, but there were only 5000 features to begin with, so everything was used. The only difference was that my tree was also made with only those features found in the rarefied table, versus all features found in the original dataset before rarefaction and filtering.

If I'm thinking about this incorrectly, someone let me know, but I don't see how you randomly rarefy 5000 features when there are only 5000 features per sample to begin with. Which still leaves the tree as being the only source of variation I can think of.

jwdebelius · April 15, 2020, 3:44pm

Hi @Heather_E,

I apologize. I misread. Yes, I would then assume that the tree is the source of variation which is likely due to slight differences after alignment. Again, Id assume due to slight differences because of tree building and having some tips.
Have you run a mantel test to check the correlation between the two distance matrices?

Best,
Justine

Heather_E · April 15, 2020, 4:30pm

I have not run a mantel test. That's a good idea. Thanks for the input.

system · May 16, 2020, 10:30pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.