Core Metrics diversity analysis on taxonomic subsets of features


(Jennifer Jankowiak) #1

HI Qiime community,

I am trying to run alpha and beta diversity on y 16S dataset using the core metrics pipeline but ran into an issue while trying to decide what sampling depth because I am trying to divide my 16S dataset into 2 subsets based on taxonomy. For my overall analysis I am looking to analyze the cyanobacterial communities and all non cyanobacteria bacterial communities in my samples. To divide my dataset I used the qiime taxa filter-table plugin to include or exclude cyanobacteria. I would now like to run the core metrics diversity pipeline on those 2 subsets however among the cyanobacteria the sequencing depth varies from about 300 to 50,000 since cyanobacteria are not highly abundant in some samples due to collection methods. I do not want to drop any samples but am worried about using such a low sampling depth of 300.

My question is if there is any way/ would it be valid to run the core metrics on the whole dataset (cyanobacteria+bacteria) to be able to use a higher sampling depth for rarefaction and then filter the rarefied_table.qza by taxonomy into the bacteria and cyanobacteria subsets and then calculate the alpha and beta diversity metrics on the bacteria and cyanobacteria tables to look at the diversity among those 2 communities.

The only other options I could think of would be to just run the core metrics on the feature tables filtered by taxonomy into the cyanobacteria and bacteria groups and just use the low sampling depth resulting in losing a lot of the data from some of the samples or to export the filtered tables and normalize by converting to Relative abundance and running the analysis in some other program such as R. I tried exporting and normalizing the tables (converting to relative abundances) and then importing back into qiime 2 to use the alpha and beta diversity plugins which I saw recommended on other posts but it was returning extra vectors.

I am currently running qiime 2 2018.6 in a virtual box .

Any help would be much appreciated


(Justine) #2

Hi @jjankowiak,

First, I think this is a hard problem, and I don’t think there’s one right answer. I think it may be an unbenchmarked question.

On a global level, it’s much easier, because you need to work from the same sampling depth to be consistent.

I’m wondering what your total sequence counts look like. Have you already filtered down to a sample set, excluding samples with low sequencing depth/selected the set of samples you’d use for a deeper rarefaction depth? If your sample with 300 cyano counts has a total say, 750 counts, then I’d throw that sample out and count it as a “failed” sample.

I think this would probably be my approach over re-normalization and rarefaction. I think it’s slightly biased: you’re essentially still biasing your diversity with sequence counts. But, you’re also dealing with a bounded subset of the community, where the abundance of sequences is something you’re kind of interested in.

I would also suggest a rarefaction curve for your cyanobacteria. It might answer questions about saturation that could be interesting. I’d look at multiple metrics: weighted and unweighted, with the caveat that weighted tends to saturated quickly and unweighted not so much. But, it will give you a better sense of depth.

Best,
Justine


(Jennifer Jankowiak) #3

HI Justine,

Thanks for your quick response!

I’m wondering what your total sequence counts look like. Have you already filtered down to a sample set, excluding samples with low sequencing depth/selected the set of samples you’d use for a deeper rarefaction depth? If your sample with 300 cyano counts has a total say, 750 counts, then I’d throw that sample out and count it as a “failed” sample.

For all my samples I had higher bacteria counts, ranging from 12000 to about 80000, so when all sequences were considered the samples seemed ok. The low cyanobacteria counts were across a group of samples which we were expecting since they were sampled from filtered water that we size fractionated and the main cyanobacteria (Microcystis) found in our samples was excluded from this one group due to its size. I was trying to avoid dropping this group of samples because one of my main objectives was to compare the cyanobacteria across these size fractionated communities. That being said I know from the taxonomic barplot that a large portion of the samples with higher sequence counts are are Microcystis as expected since it was not excluded by size in these samples and therefore the diversity of these communities is likely still being captured with the rarefaction.

I think this would probably be my approach over re-normalization and rarefaction. I think it’s slightly biased: you’re essentially still biasing your diversity with sequence counts. But, you’re also dealing with a bounded subset of the community, where the abundance of sequences is something you’re kind of interested in.

I think I will try this then. I am mostly interested in the bacteria which on their own have a very similar sequencing depth to the total dataset. The cyanobacteria are something I wanted to look at but were not the main focus of the project so hopefully this will give me an idea of the community diversity.

I would also suggest a rarefaction curve for your cyanobacteria. It might answer questions about saturation that could be interesting. I’d look at multiple metrics: weighted and unweighted, with the caveat that weighted tends to saturated quickly and unweighted not so much. But, it will give you a better sense of depth.

Is there any way to run the rarefaction curves without a phylogenetic tree currently to start exploring this with my data? I have not gotten to tree building yet since the last time I tried It ran for several days so I was holding off on that for now, but I did see there were new plugins for the tree building.

Thanks,

Jennifer


(Justine) #4

Hi @jjankowiak,

So, it sounds like your experiment with filtering worked and you got good sequencing! That’s good.

First, I’d suggest just biting the bullet and building the tree. At a sequencing depth of 12000-80000 seqs/sample, everything is going to take time, particularly if you have a large number of features from a diverse environment (like marine samples). I think everyone has their opinions on trees, but Im a big fan of phylogeny.

As far as the rarefaction curve, you can pass an argument for metric. As long as you don’t request a phylogenetic metric (i.e. faiths pd), you don’t need the phylogenetic tree.

Best,
Justine


(Jennifer Jankowiak) #5

I will go ahead and try the tree building then.

Also in case anyone was interested in splitting their datasets by taxonomy as well, I tried running the core metrics on my datasets two both ways: splitting into bacteria and cyanobacteria and then running the core metrics plugin as well as running the core metrics plugin on the whole dataset and then splitting the rarefied table into bacteria and cyanobacteria. Both ways I got essentially the same results, the boxplots were almost identical and for alpha group significance almost all the same pairs were significant besides a few that were right on the border of 0.05.

Thank you for all your help , this answers all my questions