Beta diversity graphs

TM3197 · December 11, 2023, 8:29pm

Hi,
bray_curtis_emperor.qzv (862.6 KB)
jaccard_emperor.qzv (862.5 KB)
unweighted_unifrac_emperor.qzv (865.1 KB)
weighted_unifrac_emperor.qzv (865.2 KB)
Has anyone been able to determine why the data is clustered into two major parts? Is this intentional or indicative of an error? I've followed the exact same procedure for previous analyses but have never encountered graphs like these before. I'm just curious about this discrepancy.

Any insight would be highly appreciated. Thank you.

colinvwood · December 11, 2023, 9:02pm

I think you're familiar with this!

jphagen · December 11, 2023, 9:11pm

Hi @TM3197,

We only usually see this type of pattern when two completely different datasets are combined together. The samples in the line on the left are fundamentally different from the samples in the curving cluster on the right. They have no ASV or OTU in common with the clustered set.

Are these two datasets you have combined plotted here?

-Hannah

TM3197 · December 11, 2023, 10:59pm

Thanks for the reply, Hannah. No, they are one data set (salivary samples from pregnant mothers).

jphagen · December 11, 2023, 11:16pm

@TM3197

Hmm. This is interesting! Can you please send me your feature table?

-Hannah

TM3197 · December 11, 2023, 11:19pm

oral_table_v4.qzv (773.9 KB)
Here it is. Thank you.

jphagen · December 11, 2023, 11:25pm

@TM3197,

Which type of sequencing was used? also, can you please send me the .qza version of the table? feel free to DM me if you don't feel comfortable posting it here on the forum.

-Hannah

TM3197 · December 11, 2023, 11:45pm

Thanks, Hanna I dm'ed the file. we used Illumina Miseq

TM3197 · December 11, 2023, 11:58pm

I also checked whether the left cluster is from a different batch but has samples from all the batches.

jphagen · December 12, 2023, 12:23am

Hi @TM3197,
Thank you for checking the batches! I am thinking that this may be a problem coming from the sequencer because after looking at the data locally I can see that there is no overlap in the features of the separated samples. I think it would be reasonable to reach out to your sequencer to see if something happened on their end.
Another user had a similar issue, this might be helpful to checkout!
@moderators does any one else have any ideas? Thanks!
-Hannah

jwdebelius · December 12, 2023, 3:17am

Hi @TM3197 and @jphagen,

I have a couple of ideas, which may or may not be helpful! I agree with the assesment that there's a disconnect between the features nad you essentially have two sample sets with two different sets of features.

So, if you don't think there's a biological reason for it (i.e. totally different enviroments) or a technical reason (i.e. diferent hypervariable regions), I would recommend investiaging your parameters more closely. If this is Bray Curtis, a good check is to also look at a UniFrac plot and see if you have a similar pattern. If you don't, then the issue is a trim length. If you do, then the issue is more complex! (But at least you've ruled one thing out.)

I would double check your provenance and make sure you're not mixing samples and trim parameters since that can be a culprit in denoising. (ASVs are identified off their sequences so things like failing to trim primers or having different denoising parameters can have a big impact ontaxonomic metrics but wont show up in phylogenetic ones!)

Another option would be to throw this in an empire plot from the Empress Plugin which should let you highligth a cluster of sequences on a plot and then see it on phylogenetic tree. Possibly a more discrete way to check for the taxonomy vs phylogeny break down.

Best,
Justine

TM3197 · December 12, 2023, 4:05am

Thank you for the insights, Justine.

We conducted sequencing using the MiSeq Illumina platform targeting the V3-V4 region. Currently, I have less evidence to suggest the issue lies with the sequencing process itself. These samples are from the same environment (saliva samples). When examining the BC and other graphs, I notice a similar trend between BC and unweighted UF, but the discrepancy in UF seems less apparent. I wonder if this might be related to the trimming process.
I used Cutadapt to remove primers, following these trimming points:
qiime dada2 denoise-paired
--i-demultiplexed-seqs trimmed-paired-end-demux_oral_Baby1000_v2.qza
--p-trim-left-f 10
--p-trim-left-r 20
--p-trunc-len-f 270
--p-trunc-len-r 245
--o-representative-sequences oral_rep-seqs-DADA2_v4.qza
--o-table oral_table_v4.qza
--o-denoising-stats oral_DADA2-stats_v4.qza
For this case, what trimming lengths would you recommend? Should I extend the reverse read beyond 245? How would you determine this? It appears there was enough overlapping region.
trimmed_demux_oral_Baby1000_v2.qzv (320.9 KB)

timanix · December 12, 2023, 11:22am

Hi all,
I am very puzzled by PCoA plots that are shown here.
Just for sanity check, could you try following steps:

Run cutadapt to remove proímers with "discard untrimmed" option and double-check the command.
Filter the feature table to remove features found less than 10 times and in less than 3 samples.

Could you also provide us with merging stats from dada2?

jwdebelius · December 12, 2023, 3:08pm

Hi @TM3197,

I agree with @timanix on the primer trimming issue!

It looks like you didn't discard untrimmed sequences, and maybe that's worth exploring before you change anything else. It should be the --p-discard-untrimmed parameter in cutadapt. IIRC, primers can mess up phylogeny as well, which might explain the split.

I couldn't check the unweighted UniFrac, how did you build your tree? Is it de novo or did you use fragment insertion?

Because we're seeing the issue in Bray Curtis, I wouldn't expect it to be restricted to low abundance taxa. So, Im not sure the Filtering low abundance/low prevelance features is the best test of the hypothesis. If we were seeing a bigger overlap in BC than UU, maybe, but I think this has to do with features.

Best,
Justine

wasade · December 12, 2023, 4:58pm

@jwdebelius's idea with Empress community plot is really powerful and one I've used many times for this exact exploratory purpose. I've found it less tedious than trolling feature tables directly, and a guided approach to identify problems.

For qualitative metrics like unweighted UniFrac and Jaccard, it is just a little clicking to identify what features are present only in one cluster of samples. In the PCoA plot, you can shift-click to drag a box over one cluster of samples and the corresponding portion of the phylogeny will be expressed. You can then click around in the phylogeny to find some ASVs that are present in one cluster but not the other.

I highly recommend doing this, to get specific ASVs to examine, which can then be used to work backwards to ask why these specific ASVs are driving a difference, and contrasted against ASVs when others are not.

Best,
Daniel

TM3197 · December 13, 2023, 3:05am

Thank you all. I started from the imported file and ran cut adapt as follows;
qiime cutadapt trim-paired
--i-demultiplexed-sequences paired-end-demux_oral_Baby1000.qza
--o-trimmed-sequences cutad.qza
--p-cores 6
--p-front-f CCTACGGGNGGCWGCAG
--p-front-r GACTACHVGGGTATCTAATCC
--p-discard-untrimmed

But the issue is that lots of samples ended up having very less sequences (less than 100) (please see the attached files).

For your question, "how did you build your tree? Is it de novo or did you use fragment insertion?"

I used the align-to-tree-mafft-fasttree method for constructing the phylogenetic tree. From my quick search, it appeared that this approach constructs a tree based on sequence alignment without explicitly using a reference-based method or fragment insertion.

qiime phylogeny align-to-tree-mafft-fasttree
--i-sequences oral_rep-seqs-DADA2_v4.qza
--o-alignment aligned-oral_rep-seqs_v4.qza
--o-masked-alignment masked-aligned-oral_rep-seqs_v4.qza
--o-tree unrooted-tree_oral_v4.qza
--o-rooted-tree rooted-tree_oral_v4.qza

For taxonomic analysis, Taxonomic analysis I used HOMD reference sequence set
HOMD_16S_rRNA_RefSeq_V15.23.fasta
HOMD_16S_rRNA_RefSeq_V15.23.qiime

Thanks for all the input. I have to figure out how to proceed from Cutadapt to dada2 without losing many sequences. Any insight is appreciated.
Thank youcutad.qzv (321.5 KB)
cutad.qzv (321.5 KB)

TM3197 · December 13, 2023, 3:20am

Thank you for the suggestion. I ran the cutadapt with the imported file but the issue is 68 samples out of 184 have
cutad.qzv (321.5 KB)
sequences lower than 100.
I used the following script.

qiime cutadapt trim-paired
--i-demultiplexed-sequences paired-end-demux_oral_Baby1000.qza
--o-trimmed-sequences cutad.qza
--p-cores 6
--p-front-f CCTACGGGNGGCWGCAG
--p-front-r GACTACHVGGGTATCTAATCC
--p-discard-untrimmed

timanix · December 13, 2023, 11:58am

Does it mean that the PCoA plot you showed in the post was obtained with sequences not trimmed with cutadapt?
I think I suggested it already in another topic, but did you try to play with the error rate in the cutadapt?
Did you perform any quality control (trimmomatic, etc.) before the cutadapt? It may affect primer sequences and be undesirable.

I would try to focus on the cutadapt step since the fact that while searching for primers with the option "discard untrimmed" enabled in cutadapt, you lose sequences from some samples (like they do not contain primers!) but keep in other samples (primers in sequence?). Is there any difference between those samples?

If you want to test if it is really an issue, you can add corresponding info to the metadata file (mark samples that lose a lot of sequences after cutadapt differently from samples that do not) and recreate PCoA to check if this column is responsible for sample separation.

TM3197 · December 16, 2023, 4:26am

Hi Tiimur, thanks for your reply. I did remove the primers with the following command; paired-end-demux_oral_Baby1000_v3.qza is the imported file with out any trimming.
qiime cutadapt trim-paired
--i-demultiplexed-sequences paired-end-demux_oral_Baby1000_v3.qza
--o-trimmed-sequences cutad.qza
--p-cores 6
--p-front-f CCTACGGGNGGCWGCAG
--p-front-r GACTACHVGGGTATCTAATCC
--p-discard-untrimmed
However, 68 samples ended up having less than 100 reads (attached qzv) . Then checked with beta diversity, those 68 made a different cluster

. When I checked with the sequencing facility, nothing was different between in lib preparation and sequencing. I am not sure how to proceed from here, im thinking the only option now is to remove those samples from the analysis. Any insight is appreciated.
cutad.qzv (324.3 KB)