I looked into the file table.qzv to determine --p-sampling-depth for qiime diversity core-metrics-phylogenetic analysis.
I found that sequence counts per samples are very low.
I have 70 samples but the total counts is 3,839 and some of the samples have counts lower than 10.
Is it becuase my samples have low quality?
Is it okay to continue further analyses with these files?
I am attaching table.qzv file. table.qzv (322.8 KB)
That's really low. Unless if you have very low sequence counts to start with, chances are you are losing too many sequences due to sub-optimal parameter settings in your denoising protocol.
How many starting sequences do you have? When you run dada2, use the --verbose flag so that filtering summaries are reported (then you will know where the sequences are being dropped)
What marker gene/primers are you using? What is the total length of the amplicon? Since you are using paired reads, you want to make especially sure that the trimmed reads still overlap with at least 20 nt, or you will lose reads that fail to join.
What does the quality of the demultiplexed sequences look like? Please share your demux quality QZV if you want us to take a look.
Chances are sequences are being dropped because:
you are either trimming too much and sequences fail to overlap and successfully join.
you are trimming too little and noisy, error-filled sequences (probably on the reverse reads) are being dropped because they are too noisy.
both. If reads are noisy, you want to trim appropriately β but if so much is trimmed that you cannot join these sequences, you are stuck. This sad state of affairs has happened to me many times β in which case the best course of action is to discard the reverse reads (or whichever is noisier) and just analyze the less-noisy reads in your analysis as single-end data.
Definitely not β your most highly sequenced sample still contains < 1000 sequences, which is just way too low. But the good news is that this is probably a problem with your workflow/parameters, and more sequences should be recoverable if you check out the steps I've outlined above.
How many starting sequences do you have? When you run dada2, use the --verbose flag so that filtering summaries are reported (then you will know where the sequences are being dropped)
**
I am attaching dada2 summaries.
After removing chimeras, merged sequences seem very low but somebody said it's fine. (Qiime dada2 denoise-paired denoise)
Do you think I should rerun dada2?
(qiime2-2018.2) Jeongsus-MacBook-Pro:SM_raw Jeongsu$ qiime dada2 denoise-paired --verbose --i-demultiplexed-seqs demux-paired-end.qza --p-trunc-len-f 220 --p-trunc-len-r 220 --p-trim-left-f 10 --p-trim-left-r 10 --o-representative-sequences rep-seqs-dada2.qza --o-table table-dada2.qza
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.
Write output
Saved FeatureTable[Frequency] to: table-dada2.qza
Saved FeatureData[Sequence] to: rep-seqs-dada2.qza
**
What marker gene/primers are you using? What is the total length of the amplicon? Since you are using paired reads, you want to make especially sure that the trimmed reads still overlap with at least 20 nt, or you will lose reads that fail to join.
**
I am working with demuliplexed paired-end fastq files. So I don't think there's overlapped reads between two paried seqs. Is this right? If so, can this be a matter for that I lose too many sequences while running dada2?
**
What does the quality of the demultiplexed sequences look like? Please share your demux quality QZV if you want us to take a look.
**
Here's the demux.qzv file. demux.qzv (289.5 KB)
It is expected that dada2 will filter out a large number of sequences at each of these steps β but this is only okay to a certain point. In my experience, dada2 will usually throw out 25-75% of the reads in a given dataset. If dada2 starts filtering out more than 90% of the input reads, there is either an issue with the input data (low quality) or with the processing steps. You are losing way too many reads at the merge stage, and this is very diagnostic:
This indicates that your reads are not long enough to join. Your read quality profiles look very good, and trimming each side at 220 nt seems reasonable (quality is good enough that you could probably go longer on both).
At this point I would usually be asking how long your amplicons actually are, to determine if 220 + 220 is enough to provide sufficient overlap. But I do not think that's the issue. (still, please answer the question β it will be good to know if this is a futile exercise if your amplicons are still too long)
It looks like the issue is that your reads are not all full-length. Many of them appear to be truncated to as little as 38 nt. What is going on there!? Have these reads been pre-filtered and trimmed prior to importing to QIIME2? This is going to cause problems with read joining β I would highly recommend using the rawest form of the sequences possible with dada2.
You should check out the raw reads β determine the read length frequencies if you can and figure out why so many truncated reads exist in there.
Then your possible options are:
Use the rawest reads available and follow the same steps that you already have. The untrimmed reads may provide dada2 a little more fodder for joining reads, even if they are riddled with errors.
Use only the forward reads for your analysis. The fact that reads are not overlapping will not be an issue here so you should still retain your reads (though note that you will lose reads that are shorter than trunc-len )
Use a shorter trunc-len so that shorter reads are not dropped, or set trunc-len to 0. That way, no reads will be truncated or dropped. The lack of truncation will risk losing some reads that have errors at the 3' end (probably a low risk β your quality profiles look really good), but will mean that shorter reads are not dropped at the truncation step, and hence may be retained until the joining stage (after dada2 has done its quality filtering). Even if these reads are shorter than 220 nt, they may still be long enough to join with their pairs.
Use the rawest reads available and follow the same steps that you already have. The untrimmed reads may provide dada2 a little more fodder for joining reads, even if they are riddled with errors.
Input files for dada2 were raw fastq files which were demultiplexed-paired end reads. No other trimming/filtering was done before dada2. Do you think those raw fastq files are issue here? In this case what possible options do I have?
Use only the forward reads for your analysis. The fact that reads are not overlapping will not be an issue here so you should still retain your reads (though note that you will lose reads that are shorter than trunc-len)
Do you mean I shound run dada2 with forward reads and reverse reads separately and merge them after dada2?
These trimmed reads might be the issue β but maybe not (see below). Since you did not do this trimming, I wonder if your sequencing center/service did. Who performed the sequencing? You should contact them to learn more about if/what quality control may already have been done, since these unexplained short reads seem to be preventing reads from overlapping enough to join.
No. That is effectively what dada2 already does (denoise each read separately, then join). Instead, I propose using only the forward reads (discarding the reverse reads) and running those through dada2 to see how many reads you have.
The best first course of action may be to determine the distribution of read lengths in your data. You can do this with something a bit more formal like vsearch --fastq_stats if you already have a preferred way, otherwise try this (inserting in the path to your actual raw fastq):
This will tell us how many reads should actually be long enough to merge.
Though this gets me back to an earlier question: what marker gene/primers are you using, and what is the expected read length? It occurs to me that my thoughts on raw read length may be misguided, as those sequences are probably dropped earlier by dada2 (probably at the "filtered" stage, not at the "merged" stage), which may indicate that 220 + 220 is not quite long enough, or that the paired-read alignments are failing for some reason. (sorry, I just don't know enough about where dada2 actually does this length exclusion step to be sure about this). Are you sequencing ITS by any chance?