Hi QIIME forum,
I want to use the DEICODE to analyze my ITS data from plant roots and was wondering why its okay to remove singletons if the feature abundances are converted to proportions.
Most of my samples have less than 10% singletons when performing 97% clustering with vsearch, but two of them comprise ~20% singletons! I am worried that by removing singletons, the relative proportions of the non-singleton reads will be inflated when using DEICODE which applies a centered log ratio transformation.
I am new to this field and was hoping to get some advice here on why its okay to chuck reads if the data only makes sense compositionally. Otherwise, I was thinking of rarefying first and then removing singletons and then perhaps using alternate multivariate tools which don't rely on relative proportions but on absolute abundances to examine the data. Me thinks this would get around the effect of the high proportion of singletons?
Hi @mikki,
Typically singletons are removed as a quality filter step, on the assumption that these may be off target hits (e.g., host sequence) or other errors/contamination that made its way into the sample prior to sequencing. If you're not interested in this quality filter, you could leave this out, or you could run with and without singleton removal and compare the results. Have you looked into the singletons to see what they are? You could get a rough idea of this using the qiime feature-table summarize and qiime feature-table tabulate-seqs visualizer. summarize can help you identify the feature ids of singletons (see the Feature Detail tab), and then tabulate-seqs lets you easily BLAST those against the NCBI nr database. Spot checking some of your singletons might give you a better idea of whether these are features you want to retain or discard.
I'm not an expert on DEICODE, so it may also help to have @cmartino weigh in. @cmartino, do you have input on this?
Thanks for the question @mikki and for trying RPCA! I agree with what @gregcaporaso already said.
Let's assume for a second that the singletons are real and not artifactual. RPCA and all other dimensionality reduction methods (e.g. PCoA, beta-diversity, PCA, etc....), rely on there being a low-rank structure in the data. That means that the data has some underlying set of groups of similar/different samples and each sample is not completely unique (see here what that would look like). If you want to explain why two groups of samples are different, a singleton is inherently not very useful because it can only show up in either group at most once. So there is not much that can be said about the importance of a single occurrence across groups. That is why we usually remove them for this type of analysis regardless. If you think the singleton may have some higher-level grouping that makes it important, such as other phylogenetically similar singletons all found in one group, then I would recommend Phylo-RPCA (or grouping by taxonomy but phylogeny is often much more informative). For Phylo-RPCA see here for a tutorial and here for the tool methodology publication, it is contained in our newer package containing RPCA Gemelli here.