P sampling depth

Fatemah · October 2, 2018, 8:14am

Hello
I hope you all are doing well,

There is a sequence count column next to the sample ID in table.qzv and demux.qzv (attached), Is it the total reads in R1 and R2 together for each sample?

I am not sure about p sampling depth for :
qiime diversity core-metrics-phylogenetic
--i-phylogeny rooted-tree.qza
--i-table table.qza
--p-sampling-depth ?
--m-metadata-file sample-metadata.tsv
--output-dir core-metrics-results
I have 51 samples (16s seq) and 5 samples have sequence count below than 1000. I have been advised to discard samples below than 1000 , May I know why and what would happen if all the samples are chosen regardless of their sequence counts ?
For my sample the lowest sequence count is 80, Should I choose 80 as p sampling depth? If I need to discard samples below than 1000, 5 samples will be removed from analysis.

Thank you for your help. I really appreciate it.

Mehrbod_Estaki · October 2, 2018, 6:22pm

Hi @Fatemah,
The choice of picking your sampling depth is based on a variety of factors, much of which can be found in the literature as well has been discussed on the forum. Look through this thread for several links.
I would say 1000 is a minimum starting point. You should certainly not use those lower samples as the outcome of both alpha and beta diversity analyses will be extremely biased in those samples, thus giving you rather unusable results. Unfortunate yes, but necessary in my opinion. Your overall sampling depth in all your samples seem rather low, I wonder if its worth backtracking a bit to see why you have such low coverage.

Fatemah · October 3, 2018, 1:23am

Dear Mehrbod
I used:
qiime dada2 denoise-paired \ --i-demutiplexed demux.qza \ --p-trim-left-r 20 \ --p-trim-left-f 20 \ --p-trunc-len-f 219 \ --p-trunc-len-r 200 --o-representative-sequences rep-seqs-dada2.qza \ --o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza.

Someone advised me to discard the first 20 bases that might be the primers,Is it a good practice ?
(However the first 20 bases seem to be ok on the graph (attached))
I heard bases with mean quality score below than 20 (shown in read line on attached graph) are needed to be trimmed and removed , Do you also think so?
That is why I chose 219 for forward and 200 value for reverse reads. I think these values can affect the coverage after dada2 which is low in my case, Am I correct? How many bases we are allowed to discard from each end for forward and reverse reads without effecting the analysis too much?
As my last question; Is the sequence counts in table.qzv and demux.qzv refers to the all the forward and reverse reads together for a particular sample? How do you define it?

Do I need to rerun dada2, which value you think works better and keep the majority of samples for p sampling depth?

I have attached all the necessary files. Could you please have a look.
Thank you very much for your help.

AfterDenoise.csv (2.2 KB)
demux_qzv.csv (1.0 KB)

Mehrbod_Estaki · October 3, 2018, 2:29am

Hi @Fatemah

You would only need to do this if you have non-biological sequences on the 5' of your reads. You can check with your sequencing facility to see if they have already removed these for you or not.

These are just guidelines users have come up with and are flexible depending on your data. The recommendations I see more often is to avoid median scores of <20. That is the little black line in each bar.

These values certainly can affect the dada2 results, for example if there is insufficient overlap (i.e. less than 20bp) between your forward and reverse reads then they will fail to merge and will be dropped. Or if you try to include sequences with too many low quality scores then they will also be dropped. This issue is also extensively covered in the forum, do some searching and have a field day The odd things here is that your quality plots actually look great and the demux summary indicates that almost all your reads are indeed about 220 nt long. Merging doesn't seem to be an issue either so I assume you are targeting a rather short region? You are losing most of your reads right at the filtering step of dada2. So we need to figure out why that is.
Can you provide a few additional info. What are the primer sets you are using, what region do they target, and what sequencing technology are these from? What are these samples of? Could you share with us a sample of the reads? You can use something like 'head filename.fastq' on the original reads. And what was done with these fastq files prior to being imported into qiime2.

In demux.qzv your reads are still un-merged, but the number does refer to the pair number of sequences. As in, 10forward reads, 10 reverse reads for sample 1 would show as 10 reads in your demux.qzv
After dada2 your forward and reverse reads are merged, therefore the numbers you see refer to the number number of sequences in that sample.

Keep us posted.

Fatemah · October 3, 2018, 3:20am

Dear Mehrbod

Infected breast implant samples have been sequence in Miseq platform (2x221) targeting V4 by 515F – 5’ GTGCCAGCMGCCGCGGTAA and 806R – 5’ GGACTACHVGGGTWTCTAAT. No process has been done prior to QIIME2 on these files.
There is R1 and R2 files for each sample I just import data into QIIME2 using manifest format and followed the Moving picture tutorial. As Barcodes are in the first line separated by a plus sign after the identifier, probably they already have been removed.

Thank you

Mehrbod_Estaki · October 3, 2018, 7:01pm

Thanks for the update @Fatemah,
To be honest I'm not too sure why you're seeing such a drastic drop in filtering in some samples while others behave as expected. A bit of a mystery...
One thing I noticed was that the barcodes that have been removed and put in the header lines have an "N' in them. This is an ambiguous call from the sequencer when it can't assign a nt to it. I have personally never seen this in barcodes though. Do your barcode primers actually have N bases in them or is this an artifact of the demultiplexing by your sequncer?
I also noticed that one of your actual sequences starts with an 'N'. Though I'm not 100 percent sure I would think dada2 would automatically discard any reads with Ns in them. You may want to look through some of those problematic samples to see if they do happen to have N's in them. If they are only in the beginning of your reads then the 'trim' parameter would be sufficient to get rid of them, otherwise they'll probably get discarded.
Another thing you could try is to truncate more from your reads, especially the reverse reads. Even though your quality plots look great, it could be that some samples have poor tails that are causing dada2 to filter them out. Since you are using the V4 region you can truncate quite a bit without worrying about insufficient overlap.
Give those a try and well see where it goes!

Fatemah · October 4, 2018, 12:58am

Dear Mehrbod

Barcode have already been removed and they do not appear in second line which is what we need or pure reads so they shouldn't be problematic. Even if there are some Ns in reads, I think dada2 won't discard those Ns, as we tell dada2 which bases need to be trimmed. I think it is what dada2 does. correct me if I am wrong. How much removal is expected after filtering in dada2 ?
based on previous data which value you think I can use for;
qiime dada2 denoise-paired
--i-demultiplexed-seqs demux.qza
--p-trim-left-f ?
--p-trim-left-f ?
--p-trunc-len-r ?
--p-trunc-len-r ?
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza

and accordingly for p sampling depth to increase the coverage read after filtering bad nts.

I will rerun the QIIME2 based on your suggestion to see the difference. I tried to use different values still many samples have below 1000 sequence reads which are needed to discard for p sampling.

Thank you for your help.

Mehrbod_Estaki · October 5, 2018, 4:09am

Hi @Fatemah,

This is assuming that the N values are within the x nucleotides you set in your 'trim' parameter. If there are N occurrences in the middle of your reads for instance, dada2 may still be discarding the whole reads.

From theDADA2 tutorial page:

We’ll use standard filtering parameters: maxN=0 (DADA2 requires no Ns)

This is why I was wondering if some of those problematic samples have Ns further along the middle of their reads for some reason. Would be worth checking out. N values are just nt that did not have any assignments, they happen sometimes in the first few thousands calls of Illumina which corresponds with the low quality of some runs in the first few bps on the 5' end. I imagine DADA2 has no idea how to model those so just discards the whole read all together.

That really does depend on your sample type and your sequencing steps. With low concentration samples you are more likely to have primer dimers, non-target amplification, etc but in almost all cases we still generally see at least 50% of reads passing the filter. Often much higher in the 60-90% range pass the filter. So while some of your samples fall within this range, there are certainly some that raise alarms. Again, totally dependent on the community being sequenced.

I would say let's try a rather conservative run. Still trim 20 off both your forward and reverse reads (in the event there are a bunch of N's), or more if you see Ns appearing later in those problematic reads.
And let's truncate at 170 in forward, and 190 on the reverse.
With your 221x2=442 total length for paired-ends run and your expected 806-515=291bp target size, you should have about 150 bp of overlap. So we want to make sure we don't truncate a more than 150- 30bp (for merging)= 120 bps.
In the scenario above we truncated 51 + 30 = 81, so we should still have plenty of ovlerlap and we are tossing away possibly some poor tails.

The p sampling depth step really depends on the outcome of your run. In the very first run you showed, I would set it to 1323, meaning you are discarding all samples below that min coverage. Depending on the expected diversity of that community that may or may not be sufficient. For example in low diversity samples such as wine that may be plenty enough coverage, in ocean samples that is unlikely to be good enough.
One other thing you can try is running your forward and reverse reads separately and truncate plenty, let's say at 150-160 bps position to figure out which of your sets contains the problematic reads, because by default if one direction is discarded, its pair will be tossed too.
Try these and let's see where we go from here!

Fatemah · October 6, 2018, 6:34am

Dear Mehrbod

This part was a bit confusing to me, I mean the numbers; what is corresponding to what, Could you please clarify?

" With your 221x2=442 total length for paired-ends run and your expected 806-515=291bp target size, you should have about 150 bp of overlap. So we want to make sure we don’t truncate a more than 150- 30bp (for merging)= 120 bps , In the scenario above we truncated 51 + 30 = 81, so we should still have plenty of ovlerlap and we are tossing away possibly some poor tails."

Also you mention the other opting is to running the forward and reverse reads separately and truncate which I have not tried before. May I know, how? and with which command; As I always by default use ....and follow the the Moving picture tutorial;
qiime dada2 denoise-paired
–i-demultiplexed-seqs demux.qza
–p-trim-left-f x
–p-trim-left-f x
–p-trunc-len-r x
–p-trunc-len-r x
–o-representative-sequences rep-seqs-dada2.qza
–o-table table-dada2.qza
–o-denoising-stats stats-dada2.qza

Thank you for your help. I really appreciate it.

Mehrbod_Estaki · October 7, 2018, 10:28am

Hi @Fatemah
So my calculation was as follows:
Your total coverage is: 221bp (your read lengths) x 2 (F+R) = 442
amplicnon size = 806 (position of R primer) - 515 (F primer) = ~291
We say estimated here because there will be biological variability in this amplicon size estimate
So..your total coverage 442 minus estimated amplicon size ~291 = ~151. This is roughly how much overlap you have. We need minimum of 20bp overlap, plus some wiggle room for biological variation, say another 10bp.
151 total overlap - 30 bp (min overlap needed) = ~ 121 <- this is the maximum total we should truncate between our Forward and Reverse reads to make sure DADA2 can still merge properly

To run just the forward reads (and ignore the reverse) you just use the qiime dada2 denoise-single instead of paired. See here.
If for some reason you wanted to do only the reverse, you would have to re-import your reads via manifest file and either just import your reverse reads only, or switch the names of your forward and reverse directions in your manifest and import.
I only mentioned using reverse reads alone as a point of trouble-shooting. If you were to just use one direction, stick with the forward reads since they are in better shape.

Keep us posted!

Fatemah · October 8, 2018, 1:13am

Dear Mehrbod

Using new values :
qiime dada2 denoise-paired --i-demultiplexed-seqs demux.qza --p-trim-left-r 20 --p-trim-left-f 20 --p-trunc-len-f 170 --p-trunc-len-r 190 --o-representative-sequences rep-seqs-dada2.qza --o-table table-dada2.qza --o-denoising-stats stats-dada2.qza,
We could increase the sample coverage. If you have a look at new seq count there is no more sample below than 1000 and all are above 2000, which I think 2,635 (the lowest counts for sample 1613LC_CC) would be a good p sampling depth, in this case we won't lose any sample. Do you think so ?

We also removed 20 first bases from each read which seem to be ok on the graph. I guess It is just to avoid Ns from the beginning of each read ? So By a quick look at the head of each read if there is no N, Do you think still we need to remove the first 20 bases or we can assign it as 0?

It is very interesting why this time despite removing more bases (from base 170 in forward read and 190 in reverse read) more coverage obtained but last time we removed from 200 base in reverse and 219 in forward but we ended up with many samples below than 1000 counts.

If you remember you said we usually consider quality score below than 20 which should be the median or a small black line inside each bar (as I have shown in red) Is it what you mean ?
But why this time we removed from 170 in forward and 190 from reverse? I need to know which value I need to chose for this stage, It should be based on median ? lower or upper quartile in box plot. If so why you chose 170 and 190 which their median seem quite high on the graph?

I really appreciate your help
Thank you