About:dada2 steps: filtering, merged, non-chimeric

dasqiime22 · July 17, 2019, 4:31pm

I am working with very few samples (4+4 samples in two groups) for microbiome analysis.
The reads are limited around 50-1300 in those samples.
While running dada2, I got vary limited number sequences are passing (after filtering, merged, non-chimeric steps). In two samples, I got 0 (zero) non-non chmeric count, in others cases also very few.
Whether subsequent analysis in dada2/qiime2 will be worthy with these limited count?
Can anyone suggest what should I do in further steps?

Mehrbod_Estaki · July 17, 2019, 5:36pm

Hi @dasqiime22,

This is already a very small # of reads to begin with, in most cases after processing, filtering, such low samples are removed. Here you are starting with these. Any particular reason why there are so few to begin with?

Could you share your dada2 stats visualizer please.

That really depends on what you are trying to do with the data, what is the point of the study and what questions are you trying to answer? But in all likelihood you probably don't have enough reads to carry meaningful comparisons. Certainly not from a statistical point of view.

Get more reads/sample!

dasqiime22 · July 17, 2019, 6:06pm

Thanks for quick reply!
Actually, I got the sequence (Multiplexed,paired end:R1,R2 and Barcode fastq files) from Argone LAB with many samples. But, I have to focus on few samples (8) only.
So, I start with qiime2 (EMP protocol) to get demux (sample wise fastq files). Then I separate out the targeted samples (8) where I see very few reads (min) per samples. Then using casva-18, I import those 8 samples in demux.qza file. Then I run the dada2 and get the following results

dasqiime22 · July 17, 2019, 6:16pm

Furthermore, I see the plot quality are moderately good in both 5 prime AND 3 end (>25-30). I check with no trim and also few trim at begining.
qiime dada2 denoise-paired **

--i-demultiplexed-seqs demux-paired-end.qza **

--p-trim-left-f 6 **

--p-trim-left-r 6 **

--p-trunc-len-f 150 **

--p-trunc-len-r 150 **

--o-table table.qza **

--o-representative-sequences rep-seqs.qza **

--o-denoising-stats denoising-stats.qza

Mehrbod_Estaki · July 17, 2019, 6:19pm

Hi @dasqiime22,

Unfortunately these are extremely low sequenced samples and there isn't really anything meaningful you can do even if we were to fine-tune DADA2 parameters to retain more reads. Unless these are all negative control samples where you are not expecting any diversity, but I don't think that's the case.
I wonder if something is happening at the demultiplexing step that you are losing so many reads.
You might want to first check your raw .fastq files to see how many reads you are actually working with (count the number of lines in your raw read files, divided by 4). Compare that number to the total when you demultiplex, they should be pretty close to each other. If they are not then something must have gone wrong in the demultiplexing step. If they are the same and nothing has gone wrong you may want to contact your sequencing facility about the run since these are abnormally low numbers.

dasqiime22 · July 17, 2019, 7:27pm

May be the de-multiplexing is the problem.
The fastq files are extremely big (>2GB), I am not able to open it. But total I am getting only 45K reads altogether for all samples, which are very less I think.

I received the three .fastq files: R1, R2, barcodes.
Convert the files into .fastq.gz (as I need fastq.gz for EMP), for example
gzip Undetermined_S0_L001_I1_001.fastq
Importing
qiime tools import *
--type EMPPairedEndSequences *
--input-path emp-paired-end-sequences *
--output-path emp-paired-end-sequences.qza*
For de-multeplexing in EMP I used:
qiime demux emp-paired **
--m-barcodes-file Sample-metadata2.tsv **
--m-barcodes-column BarcodeSequence **
--p-rev-comp-mapping-barcodes **
--i-seqs emp-paired-end-sequences.qza **
--o-per-sample-sequences demux.qza **
--o-error-correction-details demux-details.qza

I also check with: --p-rev-comp-barcodes ****, which is showing plugin error, so i confirm this --p-rev-comp-mapping-barcodes **

Mehrbod_Estaki · July 18, 2019, 5:15am

Hi @dasqiime22,
45K reads across 8 samples is certainly much more normal and totally workable. I think the hunch that you are losing your reads at the demultiplexing step might be right after all. Some things to check, are you sure your data is actually in EMP format? This format is very specific and we've seen users run into similar demultiplexing issues when their data is indeed not in EMP format.
You have paired-end data, but do you have barcodes only on your forward reads or on both F and R? If on both then EMP demux is not what you want, you'll want to use the cutadapt demux-paired plugin to demultiplex your reads instead.

If you indeed have EMP format and can use demux-paired then the next question is are your barcodes actually Golay? As of the newest 2019.4 qiime2 release, the default setting for demuxing EMP format is to use a Golay error-correction method. If this is not the format of your barcodes then you want to turn Golay correction off by adding --p-no-golay-error-correction to your command above.

Try these and hopefully you'll have a more successful demultiplexing. If you are unsure about your set up you can always ask your sequencing facility or whoever designed your amplicon libraries.

dasqiime22 · July 18, 2019, 1:46pm

Many many thanks!

No. I got 45K for all samples, and 3.2K for my 8 samples which is not sufficient.

My data is multiplexed, paired end and barcodes is in separate file, altogether three files (R1,R2, I1), so, I think I have to use EMP paird end protocol. Is it write?

Today, I also try EMP and got bad results with --p-no-golay-error-correction.

For cutadapt paired end, Is the barcodes in seperate file or within the sequences? Can you give a better link for cutadapt paired end analysis.

Thanks

Mehrbod_Estaki · July 18, 2019, 6:29pm

Hi @dasqiime22,

Oops, sorry I thought you meant 45k across your 8 samples. If you indeed have confirmed ~3.2k reads for you 8 samples before demultiplexing then it it would seem your original demultiplexing was correct cause your dada2 stats input column totals to about that much too. Unfortunately I just can't see any way that these samples are usable with the exception of maybe MAY-2017-5015 but even that is a stretch and of course without other samples there's isn't much to do statistic wise.

dasqiime22 · July 18, 2019, 6:40pm

Thanks 100000k!
Probably, My problem is solved. I see in the original files, I have around 2 corers sequences for all samples.
Just now I have executed with EMP and --p-no-golay-error-correction, without any --p-rev-comp-barcodes **** or –p-rev-comp-mapping-barcodes **, and I got total 994891 (100k) for my 8 samples.
I hope this is correct and most of the sequences (80%) are passed on dada2.

Thanks a lot.

Mehrbod_Estaki · July 18, 2019, 7:53pm

A post was split to a new topic: How to pick a sampling depth for alpha & beta diversity.

rrohwer · July 20, 2019, 9:14pm

Another thing you can try to increase the ASVs detected in low read/highly multiplexed samples is to process the full dataset (or at least more of the samples) through dada2 before selecting only the 4 samples you care about.

The dada2 algorithm uses lower-abundant sequences to distinguish errors from correct sequences, so it is not able to process singleton sequences and they are all simply removed. If you pool your samples when you run dada2, dataset singletons instead of sample singletons will be removed, so you will be applying less stringent abundance filtering and may retain more data. I don't see this as an option from qiime (I may be corrected?) but if you run dada2 in R the tutorial is also very easy to follow and the option to pool samples is in the error correction command: dada(... , pool = TRUE).

p.s. Another reason to bother processing the full dataset and then selecting your samples of interest is also improved chimera detection.

p.p.s. The con is obviously computational time. If pooling fills up your computer's RAM, remember everything loaded into Rstudio is in ram and delete variables you don't need anymore. Also, you could try with a subset >4 but <all of the samples or pool = "pseudo"

Mehrbod_Estaki · July 20, 2019, 9:44pm

Thanks for the additions @rrohwer!
Just a couple of complementary comments to these:
You are totally correct that the pooling option can be very helpful in including rare features and in fact this is needed for at least 1 other plugin (q2-breakaway) as it depends on those singletons/rare features for model building.
As mentioned, currently the pool option is not available in q2-dada2 but I know that this is certainly on the developers' radar and that will be added in the next release or two. If I remember correctly pool=pseudo is meant to become the new default for q2-dada2 with the chimera detecting method also matching the pooling method by default.
As for processing time, if R is deployed in Windows alone, dada2 cannot utilize multicore so it will be even more process-intensive. However, using VM, linux/unix based machines can utilize multicore.

However, all these considerations will not matter in this particular case since the # of reads in these 4 samples is extremely low even before dada2, meaning no matter how well we fine-tune dada2, it can never be more than 3.2K reads which I argue is too low for meaningful analysis.

Mehrbod_Estaki · July 22, 2019, 9:16pm

A post was split to a new topic: How to trim/truncate reads based on quality plots

system · August 23, 2019, 3:17am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.