Singletons removed from feature table(QIIME2 output)

I am handling 16S rRNA data from human gut.After following QIIME2 protocol, i have obtained 4500 ASVs.When i removed Singletons from the data, i am left with only 1500 ASVs.

Is there any issue with my data as i have read in your forum that post DADA2 denoising,one should not get large number of singletons.

I am analyzing the data for the first time.
Help and suggestions are appreciated

Welcome to the forum, @Tahseen_Abbas!
This is an interesting question, and I suspect people will have trouble answering it without more information about your data and your process. Can you share your DADA2 denoising stats, and some basic info about your study/sequences?

Best,
Chris :spider_web:

Hi Chris,

Thanks for the immediate attention to my query.

I am studying stool samples obtained from healthy individuals.The V3-V4 sequencing (600 cycles) paired end was done in 2 batches with same library.I have processed both the batches separately till DADA2 and then merged the feature table obtained from both the runs for further taxonomy classification and alpha ,beta diversity calculations.

I removed the illumina Primer sequences using cutadapt.
Primer sequences used are
16S Amplicon PCR Forward Primer = 5'
CCTACGGGNGGCWGCAG
16S Amplicon PCR Reverse Primer = 5'
GACTACHVGGGTATCTAATCC

Denoising (using DADA2) parameters for both runs:-
1st Run (Feb Run) --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 284 --p-trunc-len-r 207
2nd Run (June Run) --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 272 --p-trunc-len-r 202

feb_metadata_postcutadapt_DADA2denoising.tsv (7.2 KB) June_metadata_postcutadapat_DADA2denoising.tsv (7.0 KB)

After merging the feature tables, i trained the naive bayes classifier using Greengenes 99% OTU database using the following command:-
qiime feature-classifier extract-reads --i-sequences 99_otus.qza --p-f-primer CCTACGGGNGGCWGCAG --p-r-primer GACTACHVGGGTATCTAATCC --o-reads ref-seqs.qza

Finally i obtained ASV table of 4500 ASVs which i wanted to filter to reduce noise.So i first tried to remove singletons (which gave me a final number of 1500)

Also,pls suggest what kind of filteration strategies can i apply on my data before doing alpha and beta diversity analysis

Thanks for the comprehensive response, @Tahseen_Abbas!
I'm still relatively new to 16s sequence analysis, and am not a wet-lab scientist. I'll do my best to help, but more-experienced community members may have better insight for you.

I don't have a paper for you on the percent of features to expect removed when singletons are trimmed post-DADA2, but Auer 2017 treats singleton removal from an OTU study in depth, and the large-scale reductions of unique sequences in different study contexts are worth looking at.

Re: post-DADA2, I have only one data point to share personally - working with fecal samples in a controlled-environment mouse study, features seen in one sample seemed likely to re-appear in other samples from the same mouse/cagemates. After filtering out all features that appeared in only one sample, we saw a drop from 8836 features (ASVs) to 3841 features.

qiime feature-table filter-features \
  --i-table paired-data/DADA2/cleaned-table5.qza \
  --p-min-samples 2 \
  --o-filtered-table paired-data/DADA2/final-table

This is more attrition than you saw in your work, which is unsurprising if we assume you filtered only true singletons. Some of our single-sample features were present 3, 4, 5+ times in that one sample, and would not have been cut by a true singleton filter.

Auer (esp. section 4.4) and Bokulich 2013 both discuss "rare filter" strategies. The only approach I've used personally has been removal of samples with unusually low sequencing depth (e.g. min 1000 reads for the high-quality samples from the fecal study I discussed above). I think about this not in terms of removing rare reads as in Bokulich, but in terms of removing low-quality/failed samples.

That's officially everything I know about sequence filtering! I hope there's something of value in it for your work.

1 Like

Hi @Tahseen_Abbas,

something I noted from the dada2 code you used: For merging purposes the trimming lengths have to be the same in all the runs. If not, an ASV present in both your runs, may be not recognised as the same because the differing lengths.
I can not tell if this is the major origin of the problem you seeing, but surely will add its contribution to it (especially because you working with 2x300bp sequences, would be less important if working with v4 and 2x250bp)!
An easy way to check for this is to plot your merged table (before any filtering!) on an emperor plot, and see how many ASVs are run specifics.

Hope it helps!

2 Likes

Hi Chris

Thanks for the papers that you referred.I like the sample removal strategy.This i also use to filter out samples with low depth of coverage in whole exome studies.

Hi Luca,

You pointed out correctly.I have used different trimming lengths for both the runs.
Thanx for the explanation for that.But trimming both the runs to equal lengths might compromise with the Phred quality (Q20 i have kept for both) in one or the other.Because i can see the data for one run is better than the other.Or can i have different quality thesholds and then trim to equal lengths??

Any comments on that.

Hi,

hmm it will be necessary to trade-off a bit, I would keep the most stringent of the two settings, and may lose a bit of good sequences in the better run.
On trimming afterword at the same length I am not sure what it may happen, but you can try and look at the sequences if they would make sense.
Cheers

Hi Luca,

So now i kept the trimming parameters same for both the runs i.e
–p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 272 --p-trunc-len-r 202

With this i could get maximum of 50% non chimeric reads from one run and max 20-25% non chimeric reads from second run.
Post Denoising and merging , i have got 5000 ASVs from which again i am left with 1500 after removing singletons.
Still the scenario is same as my previous trimming parameters.

Kindly suggest

Hi @Tahseen_Abbas,
ok, I think there may be different points in here now:
a) For the chimeric reads, do you have other qiime2 dada2 result to compare these with? What is your ‘usual’ range of chimeric reads? How many PCR cycles did you use for these libraries? To my eyes you losing lots of reads here, but I don’t know what is your usual range. Is the ‘non-chimeric reads’ the step where you losing most of the reads in the last denoise runs?

b) When you say ‘the paired end was done in 2 batches’, do you mean that the same libraries have been run twice with a gap of 4 months? Have anything happened to the library during this time? (I know is not possible to keep them for long but I am not a lab person …)

c) If you proceed with the beta-diversity analysis working keeping the samples from the two runs ad different samples (eg if there one sample is run on both lane, use in the metadata suffix 1 and 2 for them), what do you see? Are the two runs separated on the emperor plot?

d) Finally, from literature, how many ASVs do you expect roughly for your samples?

I hope these are helpful reflection on the issue, let see if anybody can suggest more checks!

Cheers

1 Like

No,i don't have any other qiime2 dada2 result to compare these as we are following this pipeline for the first time in my lab and i myself is new to metagenome analysis.So i don't have any usual range of chimeric reads.30 PCR cycles were used.
Most of the reads are lost on initial read filtering step (used q20)

Yes,the libraries are same and they were of good quality

I will be proceeding with beta diversity now but was just curious whether to filter out singletons before doing any core metrices.Also, this singleton count i have calculated before taxonomy classification.After assigning taxonomy, one kind of filtering can be removing unassigned taxas,so whether this singleton removal before classification still remains a relevant filtering step??

This, i need to check but would require some expert comments for human stool samples

Hi,

ok lots on thing on this plate now!
First, 30 PCR cycles does not sound too high, but again I am not a lab guy! However, what you say about the lost at the filtering step is interesting. I wonder if you try to use forward read (R1) only how many ASVs would you get. That may be an option if trimming heavily would make you loose the overlap between R1 and R2.

I have no really experience on analysing the same library run on two different time, with that amount of time between the two runs. I wonder if anybody could help here. Cold you confirm that anything major did not happened to the library in between (broken freezer, power-cut and so on)? As aside, did the samples were kept for long before proceeding with library preparation? Is known that some species may be lost in different storage conditions (but again I am not an expert on this).

On removing the singleton, I think is very project dependent. For example, in my analysis I usually have no idea which taxa and on which abundance to look for, so I don’t usually do this filtering.
I understand the point that if a taxa is in a sample only with low abundance it may be an artefact, but which command do you use exactly (and which settings) for this filtering?

Cheers

I use qiime2R package for reading .qza file in R.Then extract the feature table,convert to dataframe & then count the number of non zero's (features present in per sample) row wise.This gives me a count vector that is then used to identify which features are present just once

Hi,

ok that may works for filtering, only a clarification: if an ASV is present only in 1 sample but with very high abundance, is your script excluding that ASV? Just to double check!

What i can see is the script is removing ASVs present in one sample with abundance values such as 22,32.The “high abundance criteria” i need to setup.How this criteria is usually defined

ok, they don’t seems highly abundant feature at least, that is good.
I think defining a filtering threshold is very much project dependent with no general rules. I suggest to go back on Auer (esp. section 4.4) and [Bokulich 2013 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531572/) as suggested above by @ChrisKeefe for that.

For what I can tell on my experience, I first try the analysis without filtering, and see from there if any is required.
Question on your experimental design, did you include any negative control in you run? Did you include them in both the runs?

Controls were there during the PCR but unfortunately they were not sequenced.I found this paper "Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data" that talks about removing contamination from the data using 2 strategies :-

(1) Sequences from contaminating taxa are likely to have frequencies that inversely correlate with sample DNA concentration and (2) sequences from contaminating taxa are likely to have higher prevalence in control samples than in true samples

1st one uses auxillary DNA quantification data that i have with me while the 2nd one relies on sequenced negative controls that i don't have.

I think i can use the first strategy to filter out contamination.
Comments needed by you

@Tahseen_Abbas, if you’re interested in filtering out contamination, I’d recommend you spend some time searching :mag: the forum and the literature and reading up on your options. Filtering out noise and filtering out contamination are not the same problem, and filtering out contamination without also removing useful information can be a more complex challenge. Unless @llenzi wants to tackle it in this topic, it’s probably best to stay focused on your initial question here. Once you’ve done some more reading on contaminant filtering, you can always open a new topic to discuss.

Circling back up the thread, I’d like to suggest you work through some of @llenzi 's great suggestions before asking for additional support on this topic:

  • Find a few publications that describe similar studies to yours (esp. as regards prep/sequencing processes, sample type, and denoising tools) and decide whether your feature counts are reasonable. If you have significantly fewer ASVs than other practitioners with similar samples and processes, then try to figure out why.
  • Experiment with your DADA2 parameters, and with using forward reads only. Tinkering with these numbers a little is often the only way to optimize the number of reads/features you preserve. If the filtering step is removing most of your reads, try trimming more aggressively. If your quality is too low to do so without losing more reads during the merging step, proceeding with forward reads might be your best bet. There are lots of good posts on how to do this here, and you might find that your data is more manageable than you expected. :crossed_fingers:
  • Plot a PCoA of all of your data together, colored by sequencing run, as described in “c” above. Check to see whether the diversity of your samples is separating based on biases introduced in prep/storage/sequencing. Knowing whether significant bias has been introduced by the different timeline/handling of your two sequencing runs may inform how you treat the data from those runs going forward.

Best,
Chris

1 Like

Hi @Tahseen_Abbas,
I agree with @ChrisKeefe, let try to keep this thread on the original question, at least for the sake of the forum. I feel as it is partially my fault by opening lots of potential different point of views, apology for that! If needed I suggest to open another thread on contamination.

As I said, I think it easier for now to try the analysis without filtering singletons and see if the data you make sense with what you would expect, form other, literature and so on. Only at that point see if apply a filter would make things clearer!

Cheers

2 Likes

Thanx Chris. Sure would do that