I recently received data from a NextSeq1000 sequencing run that included16S-V3V4 amplicon samples either from feces (human) or whole insect guts. All libraries were prepared using the same lab protocols (primers, enzime, PCR conditions, etc), although the extraction method differed: a soil kit for the insect guts and a fecal kit for the fecal samples.
I removed adapters with cutadapt and sequencing quality look good. However, while analyzing the data in qiime2 I noticed a marked difference in the amount of chimeras detected by DADA2. Fecal microbiome samples had a lot of chimeras (up to 50% of the reads), while invertebrate gut samples had almost none.
Does anyone have a good explanation for that? As I said, library preparation and bioinformatic processing were identical. Could it be related to the extraction method? Amount of template DNA (i.e.: bacteria) in the samples? Bacterial diversity?
It would be great to understand this a bit better!
The argument here is that the samples may simply have more chimeric reads in them compared to samples, which DADA2 is finding and removing. I find that keeping setting consistent is usually defensible, as long as you have 'enough' reads in both cohorts.
I don't. The consensus, as I understand it, is that chimeric reads are a product of PCR amplification 8704952, so more PCR cycles lead to higher chimeric levels PMC6531881.
PMC3044863 claims "More similar 16S genes clearly form chimeras more readily," which makes sense. So I guess the question is not of total number of different microbes but how different these microbes are. If your fecal samples have many highly similar microbes, they are more likely to form chimeras.
@colinbrislawn I suppose that's indeed a possibility, specially since in most samples I still have a good number of reads left
@SoilRotifer I tried changing --p-min-fold-parent-over-abundance to 8 and indeed got a massive decrease in chimeras!! Numbers became close to those in the insect gut dataset. Even using 4 already made a significant difference. Of course, I also got a strong increase in the number of generated ASVs, which makes sense, I guess, since we're leaving more reads in. My whole dataset is around 150 samples and ASV number went from ~2500 to ~14k when I tested this on subsampled data with 5k read-pairs per sample. This of course raises the question of whether those are real biological ASVs or undetected chimeras... but the links you showed suggest that it might be fine to use --p-min-fold-parent-over-abundance = 8, right?
Hello!
Please allow me to qiime in as well.
I would follow the recommendation of @SoilRotifer and then filter features based on prevalence and abundance. Usually, I remove features that found in less than 3 samples and with overall count less than 10. That will decrease the number of unique features. If recovered by tweaking Dada2 features are chimeras indeed, I would expect that both unique features and total feature count drastically decrease. If recovered ASVs are biological sequences, then the number of unique features should decrease while the total feature count should decrease only slightly.
Hi @timanix ,
Thanks for the suggestion! Just to check if I got it right, are you suggesting that I remove features that are simultaneously found in less than 3 samples AND with overall count less than 10? Or do I remove everything that fits one condition OR the other? i.e.: Should features that are found in a single sample be removed even if they have very high abundance in that sample?
These conditions are independent, so "OR".
For example, if I have 10 samples from one group, and all the lab work was done in the same way for all the samples, I would be very suspicious about the ASVs that were found in only one sample. Are they really biological sequences? Or some contamination?