Optimal --p-n parameter for demux summarize

Hi all,

I wonder if there's an optimal --p-n parameter value for demux summarize.
According to the document, the default value is 10000.
(--p-n parameter description: The number of sequences that should be selected at random for quality score plots)

I wonder why the default value is a fixed number rather than proportional to the number of sequences.
If the total number of sequences is 100000, then the default value 10000 is about 10% of the total sequences.
However, what if the total number of sequences is 1000000? Then, the default value 10000 is only about 1%.
I'd appreciate it if you could let me know the optimal proportion for --p-n parameter.

Sincerely,

Good question!

This choice is a trade off between faster summary, and a more complete summary. If you pass a number larger than the reads in your largest sample, you get this message:

A subsample value was provided that is greater than the amount of sequences across all samples for the %s reads. The plot was generated using all available sequences.

If you want to use all your data to make these summaries, you could!
(But would be somewhat slower, and give you statistically comparable results to using a random sampling.)
You get to choose what is 'optimal' for you! :smiley_cat:

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.