Difference between weighted UniFrac distance and bray curtis emperor plot?

danielsebas · April 7, 2019, 5:08pm

Dear Friends,

I ran qiime analysis on 10 cDNA paired end illumina samples with read length 150-20 primers = 130. After bray curtis and weighted UniFrac distance analysis I see that the emperor plots show difference? I am trying to understand that since these two plots represent quantitative measure of community dissimilarity, shouldn't the results be same or similar? I have attached the plots. I would really appreciate your comments.

bray-curtis: bray-curtis-emperorwith-label hosted at ImgBB — ImgBB

Weighted Unifrac distance: weighted-unifrac-emperor-withlabel hosted at ImgBB — ImgBB
Thanks!

jwdebelius · April 7, 2019, 10:02pm

Hi @danielsebas,

The key difference between Bray Curtis and weighted UniFrac distance is that UniFrac is phylogenetically aware. This means it treats closely related organisms as more similar, while Bray-Curtis doesn't weight them in the same way. So, depending on your underlying community structure, this may be some of what you're seeing.

I'm a bit surprised that your PC1, PC2, and PC3 in the Bray-Curtis are all explaining 11% of the variation. It seems... unlikely to have such an even distribution along all three axes.

Have you checked your unweighted UniFrac yet?

Best,
Justine

danielsebas · April 9, 2019, 11:31am

Thanks for the explanation @jwdebelius. I really appreciate. After you pointing out the 11% variation in bray curtis, I looked at the unweighted and jaccard emperor plots: (they follow similar trend due to phylogeny involvement in unweighted)

jaccard: jaccard-emperorwith-label hosted at ImgBB — ImgBB

unweighted: unweighted-unifrac-emperor-withlabel hosted at ImgBB — ImgBB

I would really appreciate your comments when looking at these plots, about the qualitative and quantitative similarity/dissimilarity among the samples. Also, could you please let me know what you meant when you said even distribution along the three axes? I understand that these are variations but can't the variations be even across the axes? Thanks much.

jwdebelius · April 9, 2019, 12:03pm

Hi @danielsebas,

As I look at the plots, the axis loading on hte Bray-Curtis and Jaccard look wrong. It's not clear to me how you have each axis explaining 11.1% of the variation. (You can see it on the axes labels). In 5 years and hundreds of PCoAs, Ive never seen a distribution where each of the axes explained the same variation. Typically, PC1 explains the most, then PC2, then PC3... I might expect a large amount of loading on PC1 (like you see in the weighted UniFrac figure), but Ive never see compeltely even loading. It's basically saying that the algorithm is ranking the three largest splits in your data to be even. And, that the three largest splits in your data are likely also even across subsequent splits. Because it's showing up in taxonomic metrics, I have some concerns that the pattern you're seeing here is because few/no features are shared across your samples. This is theoreticlally possible, but seems unlikely in practice.

Would you be willing to share general information about sample type(s), whether or not these are replicates, etc and then the sample preprocessing?

Best,
Justine

danielsebas · April 9, 2019, 2:47pm

Thanks @jwdebelius,

I understand your point. the way I processed the data is as below:

These are cDNA paired end Illummina reads (150bp) sequenced with forward and reverse primers. I cut the primers using cutadapt of Qiime2 then left with 130bp. I checked the presence of primers in forward and reverse and then trimmed the primers. Both primers are present in both R1 and R2 fastq files; hence I trimmed them from both R1 and R2.

The trimmed demuxed qzv file is:
ww-cDNA_1to11_paired-end-demux-trimmed-2.qzv (294.8 KB)

The denoised stats are:
ww-cDNA_1to11_paired-end-demux-trimmed_dada2-rep-seqs-stats-dada2.qzv (1.2 MB)

The steps I did as as below:

> qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' --input-path manifest_file_1to11-paired-end-cDNA.csv --output-path ww-cDNA_1to11_paired-end-demux.qza --input-format PairedEndFastqManifestPhred33

qiime cutadapt trim-paired --i-demultiplexed-sequences ww-cDNA_1to11_paired-end-demux.qza --p-front-f ^GTGYCAGCMGCCGC --p-front-r ^GGACTACNVGGGTWT --p-times 2 --p-cores 4 --p-error-rate 0 --p-match-read-wildcards --p-match-adapter-wildcards --o-trimmed-sequences ww-cDNA_1to11_paired-end-demux-trimmed-1.qza

qiime cutadapt trim-paired --i-demultiplexed-sequences ww-cDNA_1to11_paired-end-demux-trimmed-1.qza --p-front-f ^GGACTACNVGGGTWT --p-front-r ^GTGYCAGCMGCCGC --p-times 2 --p-cores 4 --p-error-rate 0 --p-match-read-wildcards --p-match-adapter-wildcards --o-trimmed-sequences ww-cDNA_1to11_paired-end-demux-trimmed-2.qza

qiime dada2 denoise-paired --i-demultiplexed-seqs ww-cDNA_1to11_paired-end-demux-trimmed-2.qza --p-trunc-len-f 130 --p-trunc-len-r 130 --p-trim-left-f 0 --p-trim-left-r 0 --p-chimera-method consensus --p-n-threads 4 --o-representative-sequences ww-cDNA_1to11_paired-end-demux-trimmed_dada2-rep-seqs.qza --o-table ww-cDNA_1to11_paired-end-demux-trimmed-dada2-rep-seqs-table.qza --o-denoising-stats ww-cDNA_1to11_paired-end-demux-trimmed_dada2-rep-seqs-stats-dada2.qza

After this I trained a classifier with my data; the database I used is greengene database, and used the trained classifier to classify the taxonomy of the dada2 denoised representative sequences:

qiime feature-classifier extract-reads --i-sequences gg_13_5.qza --p-f-primer GTGYCAGCMGCCGC --p-r-primer GGACTACNVGGGTWT --p-trunc-len 150 --p-min-length 100 --p-max-length 400 --o-reads /gg_13_5_ww_cDNA_1to11_paired_ref_seq.qza

qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads gg_13_5_ww_cDNA_1to11_paired_ref_seq.qza --i-reference-taxonomy gg_13_5_taxonomy.qza --o-classifier ww-cDNA_1to11_paired-end-demux-trimmed-classifier.qza

qiime feature-classifier classify-sklearn --i-classifier ww-cDNA_1to11_paired-end-demux-trimmed-classifier.qza --i-reads ww-cDNA_1to11_paired-end-demux-trimmed_dada2-rep-seqs.qza --o-classification ww-cDNA_1to11_paired-end-demux-trimmed_dada2-rep-seqs_taxonomy.qza --verbose

This gave me the taxonomy classification and after this I performed the diversity analysis bray-curtis etc.

Could you please have a look at the trimmed result and the denoised result if there lies the reason for such behaviour of bray-curtis. I have made sure that I used right parameters. your comments are very appreciated. Thanks. Please let me know if any further information is required.

jwdebelius · April 9, 2019, 3:01pm

HI @danielsebas,

The bioinformatics look okay in theory. Are you able to tell me whether these are host associated or enviromental, if they're host associated whether they come from the same bodysite, whether they're technical replicates, a case control design... something. We also need to verify that your primer regions were, in fact, different.

The algorithmic behavior may be entirely correct, but microbiome doesn't operate without context and diagnosing this somewhat bizarre situation requires additional information.

Best,
Justine

danielsebas · April 9, 2019, 3:16pm

Thanks @jwdebelius. These are environmental waste water samples from which phages were sequences using bacterial host. Please let me know of your comments and what you think could be the reason for the weird behaviour of bray and jaccard plots.
I am wondering what could I interpret from this data. I do get to know from the weighted and unweighted emperor plots the similarity and dissimilarity across the samples; however, the variation distribution on bray and jaccard plot are raising questions or they can be explained too?
Thaks for your time.

llenzi · April 9, 2019, 4:11pm

Hi @danielsebas,
can I ask a basic question on the overall design? Are these sequences really 2x150, as above?
I never work with that length so far, so I may be wrong, but would that result in enough overlapping between R1 and R2 for dada2 to work on? Am I correct on seeing the primers as 515F and 806R, giving roughly 300bp amplicons? If I am wrong, sorry for that but at least we starting from the basic!
Luca

jwdebelius · April 9, 2019, 8:24pm

@danielsebas,

Thanks! I probably should have said that I think the problem may be that your Bray Curtis distance and Jaccard distance are both saturated and that the distance is 1 between all samples. I might expect that kind of behavior to samples from very different enviroments or sequenced with a variety of primers. (Denoising on different regions would give you different ASVs). I think the UniFrac distance safely rules out a rarefaction depth issue, where low variation might be due ot a few sequences. So, we're left with a question of why (I suspect) there's saturation in the taxonomic metrics. (Actually, would you be willing to check the distance matrix files and see if the distances are at or close to 1 or all the same?)

That's a possibility, thanks for spotting it! Ive worked with eprimers and that length a lot, but we typically used single-end sequencing for a variety of reasons. If its an issue, it will show up in your DADA2 stats.

But, it leads to another question: what was your rarefaction depth?

Sorry for the sort of circuitous approach, its just something Ive never seen before.

Best,
Justine

danielsebas · April 10, 2019, 9:14am

Thank you for your explanation @llenzi and @jwdebelius. The rarefraction file is here:

ww-cDNA_1to11_paired-end-demux-trimmed_dada2-rep-seqs-alpha-rarefaction.qzv (367.1 KB)

the sampling depth is 200 here. Does that answer our questions? Thanks for your time.

llenzi · April 10, 2019, 9:51am

Hi @danielsebas,
I'll try my point of view, I'm glad if @jwdebelius will give hers or correct me!.
For you rarefaction plot, it seems that all the samples are saturated and 200 may looks reasonable from the plot. However, for me is a very low number, I would not go under 1000 (ideally more). The reason is that you may get 200 sequences by chance or because you sequencing contaminating bacteria from kits (especially not having negative control to compare with). I would replot the rarefaction using 41000 (worked out from the above denoising stats) as max number of sequences to see if does change the conclusion.
From the denoising stats, you are left with 1%-5% of the initial number of sequences. This is a low number for my taste (but then it may be enough for your purpose?).

Do you have the visualiser for the 'ww-cDNA_1to11_paired-end-demux-trimmed-dada2-rep-seqs-table.qza' file? I would expect most of the ASVs are present in most of the samples (from the top plot, a small peak on the left side of the plot and a major one on the right side), is it the case? If not, may be another hint on what Justine were saying on samples being very different.

I would take in @jwdebelius suggest and repeat the denoising with R1 only.

Hope it makes sense.
Best,
Luca

danielsebas · April 10, 2019, 10:15am

Thanks @llenzi Here is the ww-cDNA_1to11_paired-end-demux-trimmed-dada2-rep-seqs-table.qzv file.

I would really appreciate your comments.ww-cDNA_1to11_paired-end-demux-trimmed-dada2-rep-seqs-table.qzv (531.6 KB)

llenzi · April 10, 2019, 10:35am

Thanks, my interpretation of the top plot (but on that I may be wrong) is that there is a low number of ASVs co-present in 4 samples, while there is a number of ASVs present in 3 or less samples. Kind of same information looking at the 'Feature Detail' tab, in the '# of Samples Observed in' column. This would implies a very sparse abundance table (i.e. containing lots of '0' values), same to say that samples are very different. If I not misunderstanding Justine, this may a reason for

Best,
Luca

danielsebas · April 10, 2019, 11:00am

Thanks @jwdebelius. I am trying to understand the interpretation here: could you please elaborate on what you mean by "Bray Curtis distance and Jaccard distance are both saturated and that the distance is 1 between all samples" and I can come to this interpretation by looking at what analysed data. I am sorry if the answer is obvious but I am a newbie in Qiime. Thanks!

jwdebelius · April 10, 2019, 12:12pm

llenzi:

Hi @danielsebas,
I’ll try my point of view, I’m glad if @jwdebelius will give hers or correct me!.
For you rarefaction plot, it seems that all the samples are saturated and 200 may looks reasonable from the plot. However, for me is a very low number, I would not go under 1000 (ideally more). The reason is that you may get 200 sequences by chance or because you sequencing contaminating bacteria from kits (especially not having negative control to compare with). I would replot the rarefaction using 41000 (worked out from the above denoising stats) as max number of sequences to see if does change the conclusion.
From the denoising stats, you are left with 1%-5% of the initial number of sequences. This is a low number for my taste (but then it may be enough for your purpose?).

I dont think I could have said it better! Im 100% in agreement with you on all of this! I - personally - wont work with a rarefaction depth of less than 1000 sequences/sample. I would also recommend doing a single-end protocol and re-processing.

Welcome! It's a totally complex issue, and so definitely not obvious why you'd have saturation. Im having to work through explainations with equations, so please excuse the random apperance of forumulas and symbols.

The way bray-curtis and jaccard are calculated is, as you mentioned, taxonomic. So, we compare based on each of the features. But, say that A and B share 0 features, then our distance is 1 (1 - \frac{A \cap B}{A \cup B}) where A \cap B = 0

So, if none of your ASVs are shared across any samples, you're getting a distance of 1 for all your samples: saturating your metric at 1. The reason you don't see that in the phylogenetic metrics is because those are scaled based on the phylogenetic tree and even if two ASVs are not the same, there's still some distance between the two of them, \delta_{1,2}. So then, the numerator for the weighted metrics is greater than 0, because \delta_{A \cap B} > 0.

I realize I'm getting mathy, but hopefully the equations give more context to think about this?

Best,
Justine

danielsebas · April 10, 2019, 12:21pm

Perfect @jwdebelius I get it now. Thanks!

Is this the reason why the bray-curtis and jaccard emperor plots show PC1, PC2 and PC3 axis at same 11% variation; I mean what's the rational behind this 11%, the number? Thanks for your time!

jwdebelius · April 10, 2019, 1:16pm

I think the rationale has to do with the number of PCs. Im not sure how the number of PCs by default is determined, but eh. It looks like you had 9 PCs calculated. So, equal weighting on all PCs because 100%/9 = 11.11% along each PC.

Best,
Justine

danielsebas · April 10, 2019, 1:22pm

Thanks @jwdebelius I just checked the number of axes are 5 with 11.11% for all of them. Is this normal? Am asking because in the beginning of our discussion you mentioned that equal variation across PCs is not normal and we should expect different variations across PC1, PC2 and PC3 . Thanks!

danielsebas · April 10, 2019, 1:45pm

Just did rarefractionwith depth of 32680 and got this:

ww-cDNA_1to11_paired-end-demux-trimmed_dada2-rep-seqs-alpha-rarefaction-41000.qzv (340.0 KB)

Open for your comments. Thanks @llenzi @jwdebelius

jwdebelius · April 10, 2019, 1:50pm

Hi @danielsebas,

That's very abnormal, and one of the major things that suggested there was a problem.

As far as rarefaction depth goes... this is one of those things where, beyond >1000 seqs/sample as a threshold (although this is also somewhat discussed based on several factors), you need to make a judgement which optimizes the amount of depth you keep in rarefaction and the number of samples. There is no good answer, and there have been a lot of discussions of rarefaction and picking the right depth on the forum. I'll refer you to those.

Best,
Justine