I am currently working on dataset of human gut microbiomes, and I want to check how pooling samples affect the microbiota obtained from individual people. I have samples from 16 people and 3 samples of pooled microbiota from all 16 people.
I would like to create Venn diagram comparing ASVs of “Individual core microbiome”, “Pooled core microbiome”, “Individual total microbiome” and “Pooled total microbiome”.
I was able to obtain to get a list of ASV “Individual core microbiome” and “Pooled core microbiome”, but I have a problem in creating a list of all observed features from “Individual total microbiome” (all ASVs from samples from individual people) and “Pooled total microbiome” (all ASV from samples pooled). I only get a number of observed_features in every sample, but I don’t know how to create a list of all observed features to compare it with core-features.
Hi @Paulina_Srednicka ,
Welcome to the forum!
One option will be to use metadata based filtering (check out this tutorial) to filter your feature table to contain only samples from one individual and use all remained features from this filtered tables as a complete list of all features detected in samples from this individual.
If you are good in scripting (R, Python) it may be easier to convert a feature table to .tsv, read it to a dataframe, merge with metadata file and use dataframe filtering methods to subdivide dataframe by individuals and get lists of features for each.
In addition, I got a hint that you can use Jaccard distances (presence/absence) for this purposes and the distribution of within vs between individual distance, but it may be a little bit tricky.
As you said I used feature table (containig samples from all individuals, because I want to compere all observed ASV from all people to pooled samples) to make Venn diagram.
But suprisingly, about 900 ASVs from pooled samples are not found in the individual samples Is it even possible? Or I'm doing something wrong
It is recommended to filter rare (relatively some threshold) features after Dada2 since they most probably represents some errors.
I prefer to remove all sequences with frequencies < 50 but you can decrease or increase this number depending on your preferences.