Hi all,
I’m analyzing 16S rRNA sequencing data where the original data are paired-end, but the reverse reads are too short for successful merging. Because of this, I am performing the full pipeline using forward reads only (single-end DADA2).
I wanted to ask for advice on the feature filtering step, specifically how to choose appropriate values for --p-min-frequency and --p-min-samples in:
Is there any recommended workflow or considerations when filtering features derived from single-end (forward-only) reads, especially when the original data were paired-end but merging was not possible?
How should I determine sensible thresholds for --p-min-frequency and --p-min-samples, particularly in a heterogeneous dataset (e.g., patients vs. controls)?
I am concerned that very strict prevalence filters may remove biologically meaningful group-specific features.
If helpful, I can share summaries from qiime feature-table summarize (sample depth distribution and feature frequency histograms).
I would really appreciate any guidance or examples of how others set these parameters for single-end 16S data or mixed clinical cohorts.
Thank you very much for taking the time to look into my query. I have gone through @jwdebelius Justine’s advice and wanted to confirm whether my understanding is correct.
From what I understand, @jwdebelius is suggesting that the alpha-rarefaction depth can be used as a guide for setting --p-min-frequency filtering criteria (i.e., retaining only samples with sequencing depth ≥ rarefaction depth). Is that interpretation correct?
I have attached the alpha-rarefaction plot for my dataset. Based on this curve, how much --p-min-frequency can be reasonable for filtering, or would you recommend a different threshold based on where the curves begin to plateau?
Also, I checked the DADA2 output (table.qzv) and found that:
There are no singleton ASVs in the dataset (min total frequency = 2).
Each ASV is present in at least 1 sample.
Given this, it seems that feature filtering based on min freq may not be necessary, particularly since I am already removing mitochondrial, chloroplast, and eukaryotic reads.
However, I was considering applying a prevalence-based filter (e.g., --p-min-samples 2) to retain only those features observed in at least 2 samples. But here, my concern is that this may remove rare taxa that could be biologically meaningful, especially given my samples are heterogeneous (case-control).
I would appreciate your guidance on whether prevalence filtering is advisable in this context or whether skipping this freq/sample based filtering all together is fine?
Hi @colinbrislawn, thanks for confirming. Here is the case-control coloured alpha diversity rarefaction plot. Hope this is what you are talking about. Please let me know if you are expecting any other form of plot visualization.
In the plot, dark blue - control; light blue (botton curve) - case
Looks like the x axis still stops at 1000 reads, which is lower than I would hope for these days... Were you able to run this with a higher sampling depth, like 10,000? I know that can take some time
@colinbrislawn, in the rarefaction curves, the shannon diversity reached a clear plateau at approx. 250-300 reads for both control and case groups, indicating sufficient sampling depth for alpha diversity estimation.
As suggested, I rarified at 10,000 and got the plot attached below. Can you please advise further?
Though the plateau has not shifted much from the previous plot, will it not cause sample loss in downstream analysis if i still use 10,000?
If I go by @jwdebelius Justine’s advise, and use 1000 or slightly higher 1100-1200 (safer value) for feature filtering, will it not lead to sample loss.
Sure! Can you run rarefaction again, this time going to 100,000 ?
It has not changed at all! Let's see what the 1k to 100k rarefaction plot looks like!
Ah, I think that's using 1000 as an example. The idea is to set the max rarefaction value to be the same as the number of reads in the smallest sample.
However, at 80k and 100k, the other group looks smaller because it's missing samples! You can see that in the 'number of samples' graph directly below.
EDIT: here's an example from the Moving Pictures Tutorial:
Note how it's a trade-off between keeping more sequences or keeping more samples.
Hi @colinbrislawn, does that mean while doing diversity analysis (i.e., qiime diversity core-metrics-phylogenetic), I should keep --p-sampling-depth as 70,000 in the ?
Hi @colinbrislawn, do you mean to say the table-dada2.qza file one gets from dada2 step?
In the interactive plot of output file table-dada2.qzv, if i put 70k frequency threshold, it gives - Retained 4,270,000 (67.90%) observations in 61 (81.33%) samples at the specified sampling depth.
When i generate core matrices qiime diversity core-metrics-phylogenetic using 70k threshold for --p-sampling-depth it leads to only 12 samples in the alphadiversity PCA plots. PFA image
Alternatively, in the interactive plot of output file table-dada2.qzv, if I put 49003 as frequency threshold, it gives - Retained 3,430,210 (54.55%) observations in 70 (93.33%) samples at the specified sampling depth. This value (the max freq at which i am able to retain 54.55% features in 70 out of 75 samples) looks like a better choice - what do you suggest?