Can I take a subsample of representative sequences for phylogenetic tree

I have a very large merged dataset and it is to large to run typical phylogenetic inference on with align-to-tree-mafft-iqtree. The dataset is not 16S based so I can use the fragment-insertion alternative. Thus, is there a way to subsample my data after consulting alpha rarefaction curves and summary visualisation of the feature table/seqs?

Ideally I would then like to use the subsampled dataset for all analysis going forward.

I'm not sure this will work.

First, 'subsampling' happens to feature counts, not the features themselves.

Example raw table:

Feature Sample1 Sample2 Sample3 Sum of this ASV
ASV1 100 100 80 280
ASV2 100 50 50 200
ASV3 50 20 1 71
Sum of this Sample 250 170 131

Example table after subsampling:
(Note how counts per sample are all the same, and all the features are still there.)

Feature Sample1 Sample2 Sample3 Sum of this ASV
ASV1 42 70 76 188
ASV2 44 35 44 123
ASV3 34 15 0 49
Sum of this Sample 120 120 120

Ideally I would then like to use the subsampled dataset for all analysis going forward.

Because subsampling changes the counts and not the features, the tree would still be large. :deciduous_tree:

2 Likes

Maybe I do not mean subsampling then. When looking at sampling depth determined from rarefication in the feature table summary visualisation it shows x% of features would be lost. Would that not result in a smaller tree but still conveying similar information as per the theory rarefication ?

Ah! Thank you for clarifying. In that case, yes, those features would be lost at that subsampling depth and would be dropped from the tree. (You may have to drop them using an additional command, but it's possible.)

In my example, ASV3 in Sample3 had a count of zero after subsampling. If it had a count of zero in all samples, the feature could be dropped from the table and the tree. This sounds like what you want.

1 Like