Odd PCoA from distance matrix

Greetings everyone,

I followed the movingpictures example and had these weird (i.e. too symmetrical) point distribution on my data shown. These are from core-metrics-results/jaccard_emperor.qzv and core-metrics-results/bray_curtis_emperor.qzv.

I once saw similar thing like this in R and if I remember it correctly, it was due to the data transformation I did. Therefore, was this also because of --p-sampling-depth parameter?

Thank you.

Jaccard

Bray-curtis

This sort of plot happens when there is no overlap between groups. I.e. there are no shared features between your red, orange, and blue groups. If you explore your distance matrix, you will likely find the distances to be maxed out between groups. Is there any possibility of this being biologically/technically true? An example would be if each of the groups was sequenced using a different variable region. If not, you may want to carefully examine how the data was processed to spot any irregularities.

5 Likes

Hi @jbisanz thank you for your response,

I think there are features shared between red, orange, and blue. Although it is difficult to see that there are features that is present in all samples. table.qzv in feature detail tab is showing maximum only 12 (of 42 samples). The samples are sequence using same variable region of 16S.

Both uwtUnifrac and wtUnifrac from core-metrics-results folder demonstrate different pattern (i.e. irregular like typical PCoA plot). Would these mean that there are shared features between the colors?

1 Like

When the phylogenetic relationships are considered in calculating unifrac you won’t end up with the maxed-out distances which would mask the issue. I would recommend making a venn diagram of the features detected in each of your groups to see to what extent they overlap! A heatmap of ASVs would also likely show this.

2 Likes

You are right! I also learned that this is called maxed-out distances. FYI, I use qiime2r::qza_to_phyloseq() form you and then MicEco::ps_venn() for the venn below.

I wonder why… It must be the qiime2 parameter that I set, because I conducted a separate analysis with R (dada2>phyloseq) and ASV seemed to be rather distributed well than this one below.

Is it probably because of the merging process by qiime feature-table merge function? I was following FMT tutorial as a reference. Is it possible that the features ID are made three times for one identical sequence?

So these numbers (1, 2, and 3) or colors (red, orange, and blue) are actually batches. I wanted to know how different my samples based on batch grouping.

Thank you very much sir!

venndiagram

Is it possible that the features ID are made three times for one identical sequence?

I believe the MD5 hashing is deterministic (i.e. no). So when you say batches, is this 3 separate sequencing runs denoised individually and then merged? I would start by ensuring that all the clipping parameters are identical. You could check the size distribution of the SV sequences as a first pass to make sure they are the same (~252/253 for V4).

2 Likes

I see. Yes, the runs are denoised individually and then merged.

Oh, there is no identical clipping parameter for each batches, because I only approximated them by looking at demux.qzv. What is the average size for V3-V4 region?
First --p-trim-left-f 17 --p-trim-left-r 21 --p-trunc-len-f 299 --p-trunc-len-r 221
Second --p-trim-left-f 23 --p-trim-left-r 12 --p-trunc-len-f 250 --p-trunc-len-r 177
Third --p-trim-left-f 20 --p-trim-left-r 15 --p-trunc-len-f 280 --p-trunc-len-r 199

These are from rep-seqs.qzv. What do you think I should do?


Thank you.

Yes, therein is your problem. Assuming that the the sequencing run was set up identically for each of the 3 runs, you need to match the trim-left parameters across sequencing runs. Conceptually your problem is that when combining sequencing runs, it is done on perfect matching of the sequences, and if the length of the ends is variable, they will be treated as different sequences as below:

Run1 ..AGCGCGT...
Run2 .GAGCGCGT...
Run3 AGAGCGCGT...

Mehrbod has done a better job of explaining it here:

Unrelated issue which I see from your length distribution is the large number of features you have that are around ~250bp. I recall there is a bit of length variation in V3-V4 but I don’t think that much. You may want to investigate how others are trimming and overlapping their reads for V3-V4 on MiSeq (I assume from the truncation lengths you are using).

3 Likes

Excuse my long absence,

I try to redo the process again, with the script as the following:
First --p-trim-left-f 2 --p-trim-left-r 0 --p-trunc-len-f 284 --p-trunc-len-r 235
Second --p-trim-left-f 3 --p-trim-left-r 0 --p-trunc-len-f 284 --p-trunc-len-r 230
Third --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 285 --p-trunc-len-r 207

But, the venn diagram also showed no shared features at all, does this mean I should adjust all the --p-trim-left-[f OR r] to 0? Maybe it is also worth mentioning that prior to this, I conduct qiime cutadapt trim-paired to remove adapter-onwards/preceded part (illumina miseq kit). This was without --p-discard-untrimmed.

Also, will this result affect the taxa-bar-plot.qzv?

Thank you very much for your kind assistance.

The trim parameters need to be identical (assuming sequencing strategy was identical). Ie for all runs something like:

--p-trim-left-f 23 --p-trim-left-r 21 --p-trunc-len-f 250 --p-trunc-len-r 177

That appears to be your the most extreme of your settings, but double check that will leave enough sequence for overlap.

Hi,
Thank you very very much for the suggestion. I have tried the parameter but ended up with very low shared ASVs (~55). I am planning on doing the cutadapt first, and then re-investigate more suitable identical parameter for three batches but there is a problem about it. Is it okay to post it here?

Moved here

Thank you very much for your help @jbisanz !