I am working on this dataset made of 16S from 102 soil samples from 4 different places. They are all forward reads targeting V3 region.
Quality drops from around base 260 and because of it I've decided to evaluate whether do or don't trim reads at position 260 could improve diversity analyses.
Here you can find a QC file:
I ran denoising (DADA2) with trimming at position 260 and not trimming, both with --p-max-ee 1.5.
I then ran core-metrics-diversity and collected the alpha and beta diversity results, and here begins my asking for any advice.
Number of features reduced from 9806 to 8315, which I don't consider an issue. Nevertheless, diversity outputs happen to be quite divergent when looking to shannon index and number_of_observed_features. They rose up after trimming.
Indeed, from my experience shannon indexes from soil samples use to range from 6.5 to 8.5-9 instead of 7.6 to 9.7, and those intervals match the not trimmed and trimmed datasets here.
Here a depict of a few samples:
Beta diversity significances are strongly significant different after permanova in both cases, and PCoA plots of weighted unifrac and bray curtis dissimilarities also look similar for trim/not trim, so I don't think there are effects on it.
However there are differences on adonis output.
Discalaimer: I still didn't access chemical measurements from soil minerals, so I simulated values of a factor called "measurement" and tested Bdiversity
What I would expect is that significances would be similar but there are huge differences in the p value and R2, alongside the squares...
So, could someone be so kind as to help me understanding why such differences in diversity after trimming reads? Also, does it sound reliable to keep with those trimmed reads, considering quality improvement, even after those differences?
My first big question is what do your numbers look like out of your DADA2 summaries? Are there big changes in your read depths? The second is if your rarefaction depth changed? And then, finally, is trim quality/read length associated with the group? Is there something that might be affecting fragmentation around that 260 position in your sample that may be different across your group? Were they processed seperately somehow? Do you have reason to believe the variation is technical?
1st) Not quite. Depths are in average 1.12x higher in trimmed reads. 2nd) No. Both 15000 reads.
Yes, I believe it may be associated with the “soil group” because they ran samples from isolated bacteria (from other projects) in the same run and they were fine. That’s something we are used to do here. I dind’t have to trim them as their qualities were ok. These soils samples were DNA extracted separatedly, but all the libraries were made at the same round.
There is technical issue regarding QC in this seq run, however not sure if this deviation in the diversity indexes is because of technical issue. The QC profile of this particular seq run was lower to the routine runs here. But if I look to the samples by their nature, namelly isolated bacteria, fermentation broth and soil, the latter group is were qualities drop by ~260 (I actually did it using fastqc). I haven’t trimmed reads from soil samples before because all the data I’ve got were ok, so I am on two things now: 1) a test: get back to a normal data from other soil samples, trim and denoise and see if the diversity changes; maybe that is something normal that I’ve never realized; 2) a hipothesis: no-trimmed reads are ~280-290 nts and trimmed are 260nts. Somehow, lacking of 20-30 nucleotides triggered a higher diversity, which would be weird because it is only 20-30 nts and doesn’t seem enough to raise all these differences to me .
Hi Justine, thanks.
As an update and for future readers, I followed your hint and have dig a little bit more to make a decision. I am taking the "not trimmed --max-ee 1.5" dataset.
I took a good dataset as a positive control and compared trimmed and not trimmed data (--max-ee 1.5) and found out that both look similar in terms of alpha and beta diversity metrics, number of reads post-denoise and how many reads were being through away at every DADA2 step.
A fake adonis also resulted similar p-values for both datasets.
Following correlation plots between number of input reads and number of reads after each DADA2 step for a trimmed (tr) and not-trimmed (nt) dataset:
Meaning that for this good data it doesn't matter if the reads were trimmed or not in terms of keeping reads.
Meanwhile, the "trouble" dataset who brought me into asking here looked similar when reads were trimmed (--max-ee 1.5) and a little bit similar for not trimmed (--max-ee 1.5). I also did denoise with --max-ee 0.5 both cases and it didn't look nice. Altogheter, I considered keep with the "not trimmed --max-ee 1.5" as it looks more like a "good" dataset overall.