Hello,
I am writing to you, because I would really appreciate your opinion on an issue that I have with the data I'm currently analyzing.
I will try to describe my dilemma as succintly as possible:
-
I have NGS amplicon data for three groups.
-
The data was sequenced on two different runs.
-
Unfortunately, the sequencing facility did not randomize the samples as instructed and all samples from group 1 were sequenced in the first run, and the group2 and group3 were sequenced together in the second.
-
The number of reads in group 1 is much higher (100000+ compared to 10 000-20 000) than in other two groups.
-
I would like to use methods that take compositionality into account (like in phylofactor, or codaseq package) and also do not want to filter data by abundance or anything.
-
However, I am a bit worried by much higher number of reads in the group 1.
So my dilemma is:
-
use the data as is
-
subsample the group 1 samples to random (various) depths within the range of other two groups to account for any possible systematic variation due to the number of reads.
I do not have anybody in my vicinity to discuss this problem with, and I would really like to have some opinions on this before I start the main analysis.
Thanks in advance,
M