How to perform filtering step for forward read only analysis?

Hi all,
I’m analyzing 16S rRNA sequencing data where the original data are paired-end, but the reverse reads are too short for successful merging. Because of this, I am performing the full pipeline using forward reads only (single-end DADA2).

I wanted to ask for advice on the feature filtering step, specifically how to choose appropriate values for --p-min-frequency and --p-min-samples in:

qiime feature-table filter-features \
  --i-table table-dada2.qza \
  --p-min-frequency <value> \
  --p-min-samples <value> \
  --o-filtered-table table-filt.qza

My queried:

  1. Is there any recommended workflow or considerations when filtering features derived from single-end (forward-only) reads, especially when the original data were paired-end but merging was not possible?

  2. How should I determine sensible thresholds for --p-min-frequency and --p-min-samples, particularly in a heterogeneous dataset (e.g., patients vs. controls)?
    I am concerned that very strict prevalence filters may remove biologically meaningful group-specific features.

If helpful, I can share summaries from qiime feature-table summarize (sample depth distribution and feature frequency histograms).

I would really appreciate any guidance or examples of how others set these parameters for single-end 16S data or mixed clinical cohorts.

Thanks in advance!

Hello again, :wave:

Hopefully, more of the single-end data pass the filters than the paired-end data, but there are no specific requirements.

As we discussed in our previous thread, filtering is a trade-off.

I like Justine's advice:

Let us know what you try next!

(Please don't worry too much about reviewer three; if you need to change something later, you can.)

1 Like

Hi @colinbrislawn,

Thank you very much for taking the time to look into my query. I have gone through @jwdebelius Justine’s advice and wanted to confirm whether my understanding is correct.

From what I understand, @jwdebelius is suggesting that the alpha-rarefaction depth can be used as a guide for setting --p-min-frequency filtering criteria (i.e., retaining only samples with sequencing depth ≥ rarefaction depth). Is that interpretation correct?

I have attached the alpha-rarefaction plot for my dataset. Based on this curve, how much --p-min-frequency can be reasonable for filtering, or would you recommend a different threshold based on where the curves begin to plateau?

Also, I checked the DADA2 output (table.qzv) and found that:

  • There are no singleton ASVs in the dataset (min total frequency = 2).

  • Each ASV is present in at least 1 sample.

Given this, it seems that feature filtering based on min freq may not be necessary, particularly since I am already removing mitochondrial, chloroplast, and eukaryotic reads.

However, I was considering applying a prevalence-based filter (e.g., --p-min-samples 2) to retain only those features observed in at least 2 samples. But here, my concern is that this may remove rare taxa that could be biologically meaningful, especially given my samples are heterogeneous (case-control).

I would appreciate your guidance on whether prevalence filtering is advisable in this context or whether skipping this freq/sample based filtering all together is fine?

Thank you again for your time and guidance.

Good afternoon,

Your interpretations are correct, at least to my reading! Good job.

Yes, it would remove rare taxa! Do you care about rare taxa or do you want to focus on major players in the ecosystem?

(Note that this does not have a 'right' answer: YOU get to choose!)

I do have some advice on the alpha diversity rarefaction plot, if you are interested!

Looks like the script ran from 0 to 1000 reads per sample.
1000 reads per sample is pretty low these days...

Maybe try running this from 1,000 to 10,000 reads per sample?
Can you also color the lines by case-control so we can see if those seperate?

Thanks!

Hi @colinbrislawn, thanks for confirming. Here is the case-control coloured alpha diversity rarefaction plot. Hope this is what you are talking about. Please let me know if you are expecting any other form of plot visualization.

In the plot, dark blue - control; light blue (botton curve) - case

1 Like

That's great! I can see both groups in the plot :chart_increasing:

Looks like the x axis still stops at 1000 reads, which is lower than I would hope for these days... Were you able to run this with a higher sampling depth, like 10,000? I know that can take some time :hourglass_not_done:

@colinbrislawn, in the rarefaction curves, the shannon diversity reached a clear plateau at approx. 250-300 reads for both control and case groups, indicating sufficient sampling depth for alpha diversity estimation.

  1. As suggested, I rarified at 10,000 and got the plot attached below. Can you please advise further?

  2. Though the plateau has not shifted much from the previous plot, will it not cause sample loss in downstream analysis if i still use 10,000?

  3. If I go by @jwdebelius Justine’s advise, and use 1000 or slightly higher 1100-1200 (safer value) for feature filtering, will it not lead to sample loss.

1 Like

Sure! Can you run rarefaction again, this time going to 100,000 ?

It has not changed at all! Let's see what the 1k to 100k rarefaction plot looks like!

Ah, I think that's using 1000 as an example. The idea is to set the max rarefaction value to be the same as the number of reads in the smallest sample.

Hi @colinbrislawn, here is the plot rarefied at 100,000

1 Like

Thank you for sharing that!

With this new graph in hand, we can return to this comment:

This is technically correct, but we want to maximize both the number of samples and the number of reads in these samples.

As you can see, the difference between groups is larger at 70,000 then it is at 300:

However, at 80k and 100k, the other group looks smaller because it's missing samples! You can see that in the 'number of samples' graph directly below.


EDIT: here's an example from the Moving Pictures Tutorial:
Note how it's a trade-off between keeping more sequences or keeping more samples.

Hi @colinbrislawn, does that mean while doing diversity analysis (i.e., qiime diversity core-metrics-phylogenetic), I should keep --p-sampling-depth as 70,000 in the ?

Maybe?

Can you show me that Number of Samples x Sequencing Depth plot for your data?

If you were to use 70k reads per sample, how many samples would you lose?

Hi @colinbrislawn, do you mean to say the table-dada2.qza file one gets from dada2 step?

  1. In the interactive plot of output file table-dada2.qzv, if i put 70k frequency threshold, it gives - Retained 4,270,000 (67.90%) observations in 61 (81.33%) samples at the specified sampling depth.
  2. When i generate core matrices qiime diversity core-metrics-phylogenetic using 70k threshold for --p-sampling-depth it leads to only 12 samples in the alphadiversity PCA plots. PFA image
  3. Alternatively, in the interactive plot of output file table-dada2.qzv, if I put 49003 as frequency threshold, it gives - Retained 3,430,210 (54.55%) observations in 70 (93.33%) samples at the specified sampling depth. This value (the max freq at which i am able to retain 54.55% features in 70 out of 75 samples) looks like a better choice - what do you suggest?

It's not the table from DADA2, it's the output from qiime diversity alpha-rarefaction
This one: :backhand_index_pointing_down:

I'm interested in this trade-off:

So you can keep more observations or more samples, but not both!

That graph will show you what number you need to keep ALL of your samples.

It's a difficult decision...

Are there any samples you can lose without damaging the blocking in your study design?

How many replicates do you have within each block of your study?