Bray-Curtis PCoA results visualisation problem

Sapsanas · December 12, 2017, 6:47pm

Hello,
I got a strange result from visualization of bray_curtis_emperor.qzv after the step of computing alpha and beta diversity metrics. Seems like the data doesn't match with metadata.

For other output .qzv files from running this command it seems OK (an example is attached, unweighted_unifrac_emperor.qzv).

The command is the following:
qiime diversity core-metrics-phylogenetic --i-phylogeny rooted-tree.qza --i-table table.qza --p-sampling-depth 5000 --m-metadata-file sample-metadata.tsv --output-dir core-metrics-results

Results from
qiime diversity beta-group-significance --i-distance-matrix core-metrics-results/bray_curtis_distance_matrix.qza --m-metadata-file sample-metadata.tsv --m-metadata-category Group --o-visualization core-metrics-results/bray-curtis-group-siqgnificance.qzv --p-pairwise

are also meaningless:

Have you ever seen such problem before? Is there any possible solution?
Thank you in advance!

P.S. It is another question, but it is not clear from QIIME2 info which database is used during the early steps of early core-metrics-phylogenetic generation.

ebolyen · December 13, 2017, 11:46pm

Hey @Sapsanas,

Thanks for the screenshots, those results definitely indicate an issue.

What is probably happening here is coming from a problem upstream in your analysis.

So that I can confirm that, could you post a .qzv (or a small .qza file) from your core-metrics-phylogenetic run?
I'll be able to look at the provenance to double-check.
Could you also briefly describe what you are analyzing, and where your sequences came from (how many runs, is it a meta-analysis, etc)?

Since we are seeing the strange results with Bray-Curtis, we should think about its definition and what that means. It is 1 minus the sum of shared features over the sum of all features between two given samples; then in order to see a value of 1, there must be no shared features.

This gets into your question here a bit:

Depending on your analysis, there may not be any database used. If you were following the Moving Pictures Tutorial for example, a database doesn't actually come into play until we do taxonomic analysis (well after core-metrics).

Instead of having OTUs that map to something like Greengenes, we are generally using what are called Amplicon Sequence Variants (or ASVs). These are effectively 100% OTUs with some denoising to correct for sequencing error. What this means is that the ASVs are only comparable if they are from the same amplicon target, and are the same length.

If you had multiple runs, of different amplicon targets, and then merged them into the same table, there would be no shared features between runs. And you would end up with many samples which had a Bray-Curtis distances of 1 between each other.

Similarly if you had multiple runs, but trimmed at different lengths (trunc-len with paired-end is a special case) you would have representative sequences which, while coming from the same amplicon target, do not match the representative sequences of other runs. Once again this results in features that never match, and samples that always have a Bray-Curtis distance of 1.

This also explains a bit why you do see "normal" separation for unweighted UniFrac, it has a phylogenetic component. In QIIME 2 we don't use a reference phylogeny, instead we construct a quick-and-dirty one on the fly with MAFFT and FastTree. So even though your features aren't comparable with each other, there will still exist some alignment and therefore there will be some phylogeny that can be constructed. So the spread you see from the unweighted UniFrac PCoA is really just because there exists a phylogeny between your representative sequences (ASVs).

If you are doing a meta-analysis with different targets or don't have the raw sequence data available, there are some OTU-based methods which you can use to resolve this, but I would need to know more about your dataset to really recommend something.

Let me know if that makes sense!

Sapsanas · December 18, 2017, 3:19pm

Hi @ebolyen!

Thank you for the quick response, the problem with Bray-Curtis is already solved. The problem was in NA (in table.qza we had samples that were not presented in metadata).

system · January 18, 2018, 9:19pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.

jairideout · February 16, 2018, 4:59pm

In the QIIME 2 2018.2 release, Metadata now supports IDs and column names that are NA; this name will no longer be interpreted as missing data.

There are a number of other changes to QIIME 2 Metadata in the 2018.2 release. See this forum announcement for details on what changed, as well as the updated Metadata tutorial.