I lost 90% features after removing singletons by '--p-min-samples 2'

Hello guys, I am confused on trying to interpret my result after removing all singletons. From my understanding, singletons are those features (unique ASV or OTU) that only exist in one sample or with its frequency across all samples less than 10 (this threshold is just a rough guess,could be also 20 or 5); and the singleton could be the noise casued during the sequencing process, is this right?

So after completing the DADA2 denoise process within qiime2, I got my table artifact, rep-seq and statistic as well. Then I found out I have 5394 ASVs in total, which is reasonable. But I noticed that I have so many singletons (ASV only appear in one sample with the frequency less than 10) among the 5394 ASV. So I use the following code pipeline to filter out the singletons.
qiime feature-table filter-features
--i-table tablenp1.qza
--p-min-samples 2
--o-filtered-table feature-frequency-filtered-tablenp1.qza
And then the filtered table result is so weird to me. I only have 866 ASVs left. I lost most of my features. The micorbiome data i am analysing is from Mouse Gut (treated with H.pylori). I browsed other related post in this forum, I found this post Too many singletons and check whether there is primer or barcode existed in my raw data. But I don't think this fit my scenario. Becase my raw data is paired-end demultiplexed data, so there is no barcode in the sequence; and I already removed the primer before with the following code:
qiime cutadapt trim-paired
--i-demultiplexed-sequences demux-paired-end.qza
--p-front-f ^CCTACGGGNGGCWGCAG
--p-front-r ^GACTACHVGGGTATCTAATCC
--p-adapter-f GGATTAGATACCCBDGTAGTC
--p-adapter-r CTGCWGCCNCCCGTAGG
--p-times 2
--p-cores 30
--p-error-rate 0.1
--p-match-adapter-wildcards
--p-match-read-wildcards
--p-discard-untrimmed
--o-trimmed-sequences paired-end-demux-primertrimmed.qza
So what could be the reason for so many singletons in my result? are they true features or just noise, if it is noise, why there is so many noise in my raw data?

tablenp1.qzv (564.5 KB)
feature-filtered-tablenp1.qzv (474.7 KB)
please find the above table qzv files before and after the filter.

Any comment will be appreciated. Thanks in advance.

1 Like

Hello @BOYANG_HUANG and welcome to the forums! :qiime2:

I appreciate your detailed post. You are on the right track!

Understood! There are a lot of low abundance features in this table.

This was also my first thought. I can see why the trimming would reduce this problem...

Let's return to the origional defenition of singleton.

That's right: one read from one sample
(I suppose it could be <10 reads, but I usually see 1)

Because biological things usually come in multiples, seeing only one is suspect.
And, the DADA2 pipeline removes errors that would cause singletons already.

You have done everything right, yet 84% of your features are singletons!
I'm stumped! Maybe you could make an MSA of the dada2 rep-seqs?

2 Likes

@colinbrislawn .Thanks for your prompt response. This forum just has so many nice people. I will take your suggestion, to do the MSA on my DADA2 rep-seq. And in the meantime, i was inspired by this post Too many features? , so I will also try to do the downsteam analysis with or without filtering out the rare ASVs, see what could be the difference, and then decide which pipeline I am going to focus on; and also I will check other literatures on mice gut microbiota, to find out what is the feature table like in other studies, do these thoughts make sense to you? These are all the solutions I could come up with.

2 Likes

Hello!
I just got a new dataset to play with, also targeting mice's gut microbiome. I followed similar to yours pipeline by demultiplexing, removing primers (discarding untrimmed) and running dada2. I got around 2500 unique ASVs, and after filtering (10 counts, 2 samples) I got ~650. We targeted V1-V2.
IMHO, the overall count of reads that you kept after filtering is more important than the number of unique features. If you lost a significant part of the reads, that means that definitely there is something wrong with sequences, like barcodes still attached or something else that should be investigated. If you are losing insignificant part of overall count that means that they are probably not ...

... biological things.

What I wrote above is my IMHO and I mostly joined to listen to ...

... opinions regarding this issue since I had some doubts about my data as well.

1 Like

Hi @timanix , thanks for sharing your data. You are right, I only lost insignificant part of overall reads after filtering out rare ASVs, from 16,890,049 to 16,864,755. So I assume this doesn't affect the downstream analysis dramatically. And the unique feature numbers from your dataset on the mice gut micobiome is similar to mine (864) after filtering. So I assume we can just proceed the analysis with those hundreds features.

1 Like

Great!

Right. You could try alpha and beta diversity on the filtered and unfiltered tables to check.

I would love to know what kind of ASV filtering is common today.