Low sequence counts

Jeongsu_Kim · March 2, 2018, 1:01pm

I looked into the file table.qzv to determine --p-sampling-depth for qiime diversity core-metrics-phylogenetic analysis.
I found that sequence counts per samples are very low.
I have 70 samples but the total counts is 3,839 and some of the samples have counts lower than 10.
Is it becuase my samples have low quality?
Is it okay to continue further analyses with these files?

I am attaching table.qzv file.
table.qzv (322.8 KB)

Nicholas_Bokulich · March 2, 2018, 4:46pm

That's really low. Unless if you have very low sequence counts to start with, chances are you are losing too many sequences due to sub-optimal parameter settings in your denoising protocol.

How many starting sequences do you have? When you run dada2, use the --verbose flag so that filtering summaries are reported (then you will know where the sequences are being dropped)
What marker gene/primers are you using? What is the total length of the amplicon? Since you are using paired reads, you want to make especially sure that the trimmed reads still overlap with at least 20 nt, or you will lose reads that fail to join.
What does the quality of the demultiplexed sequences look like? Please share your demux quality QZV if you want us to take a look.

Chances are sequences are being dropped because:

you are either trimming too much and sequences fail to overlap and successfully join.
you are trimming too little and noisy, error-filled sequences (probably on the reverse reads) are being dropped because they are too noisy.
both. If reads are noisy, you want to trim appropriately — but if so much is trimmed that you cannot join these sequences, you are stuck. This sad state of affairs has happened to me many times — in which case the best course of action is to discard the reverse reads (or whichever is noisier) and just analyze the less-noisy reads in your analysis as single-end data.

Definitely not — your most highly sequenced sample still contains < 1000 sequences, which is just way too low. But the good news is that this is probably a problem with your workflow/parameters, and more sequences should be recoverable if you check out the steps I've outlined above.

Good luck!

Jeongsu_Kim · March 3, 2018, 3:10am

**

How many starting sequences do you have? When you run dada2, use the --verbose flag so that filtering summaries are reported (then you will know where the sequences are being dropped)

**
I am attaching dada2 summaries.
After removing chimeras, merged sequences seem very low but somebody said it's fine. (Qiime dada2 denoise-paired denoise)
Do you think I should rerun dada2?

(qiime2-2018.2) Jeongsus-MacBook-Pro:SM_raw Jeongsu$ qiime dada2 denoise-paired --verbose --i-demultiplexed-seqs demux-paired-end.qza --p-trunc-len-f 220 --p-trunc-len-r 220 --p-trim-left-f 10 --p-trim-left-r 10 --o-representative-sequences rep-seqs-dada2.qza --o-table table-dada2.qza
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_paired.R /var/folders/vh/l4ckn9311hx8yrvvbfpnh6zh0000gn/T/tmp9isl57vm/forward /var/folders/vh/l4ckn9311hx8yrvvbfpnh6zh0000gn/T/tmp9isl57vm/reverse /var/folders/vh/l4ckn9311hx8yrvvbfpnh6zh0000gn/T/tmp9isl57vm/output.tsv.biom /var/folders/vh/l4ckn9311hx8yrvvbfpnh6zh0000gn/T/tmp9isl57vm/filt_f /var/folders/vh/l4ckn9311hx8yrvvbfpnh6zh0000gn/T/tmp9isl57vm/filt_r 220 220 10 10 2.0 2 consensus 1.0 1 1000000

R version 3.4.1 (2017-06-30)
Loading required package: Rcpp
DADA2 R package version: 1.6.0

Filtering ......................................................................
Learning Error Rates
2a) Forward Reads
Initializing error rates to maximum possible estimate.
Sample 1 - 170760 reads in 40057 unique sequences.
Sample 2 - 72976 reads in 14057 unique sequences.
Sample 3 - 80375 reads in 17519 unique sequences.
Sample 4 - 91867 reads in 19942 unique sequences.
Sample 5 - 36676 reads in 7725 unique sequences.
Sample 6 - 40860 reads in 6791 unique sequences.
Sample 7 - 74639 reads in 16195 unique sequences.
Sample 8 - 51861 reads in 8882 unique sequences.
Sample 9 - 57126 reads in 12585 unique sequences.
Sample 10 - 59322 reads in 13621 unique sequences.
Sample 11 - 59887 reads in 12912 unique sequences.
Sample 12 - 53675 reads in 15168 unique sequences.
Sample 13 - 67513 reads in 14291 unique sequences.
Sample 14 - 46384 reads in 7620 unique sequences.
Sample 15 - 91381 reads in 15922 unique sequences.
selfConsist step 2
selfConsist step 3
selfConsist step 4
selfConsist step 5
selfConsist step 6
selfConsist step 7
Convergence after 7 rounds.
2b) Reverse Reads
Initializing error rates to maximum possible estimate.
Sample 1 - 170760 reads in 41084 unique sequences.
Sample 2 - 72976 reads in 17249 unique sequences.
Sample 3 - 80375 reads in 20455 unique sequences.
Sample 4 - 91867 reads in 22107 unique sequences.
Sample 5 - 36676 reads in 7559 unique sequences.
Sample 6 - 40860 reads in 7705 unique sequences.
Sample 7 - 74639 reads in 17660 unique sequences.
Sample 8 - 51861 reads in 9785 unique sequences.
Sample 9 - 57126 reads in 12761 unique sequences.
Sample 10 - 59322 reads in 14496 unique sequences.
Sample 11 - 59887 reads in 14477 unique sequences.
Sample 12 - 53675 reads in 14563 unique sequences.
Sample 13 - 67513 reads in 15693 unique sequences.
Sample 14 - 46384 reads in 8816 unique sequences.
Sample 15 - 91381 reads in 17443 unique sequences.
selfConsist step 2
selfConsist step 3
selfConsist step 4
selfConsist step 5
selfConsist step 6
Convergence after 6 rounds.
Denoise remaining samples .......................................................
The sequences being tabled vary in length.
Remove chimeras (method = consensus)
input filtered denoised merged non-chimeric
A1_S68_L001_R1_001.fastq.gz 199077 170760 170760 128 122
A10_S3_L001_R1_001.fastq.gz 80574 72976 72976 40 40
A11_S1_L001_R1_001.fastq.gz 92433 80375 80375 42 42
A12_S72_L001_R1_001.fastq.gz 103531 91867 91867 69 69
A14_S75_L001_R1_001.fastq.gz 41777 36676 36676 33 33
A15_S61_L001_R1_001.fastq.gz 46939 40860 40860 49 49
Write output
Saved FeatureTable[Frequency] to: table-dada2.qza
Saved FeatureData[Sequence] to: rep-seqs-dada2.qza

**

What marker gene/primers are you using? What is the total length of the amplicon? Since you are using paired reads, you want to make especially sure that the trimmed reads still overlap with at least 20 nt, or you will lose reads that fail to join.

**
I am working with demuliplexed paired-end fastq files. So I don't think there's overlapped reads between two paried seqs. Is this right? If so, can this be a matter for that I lose too many sequences while running dada2?

**

What does the quality of the demultiplexed sequences look like? Please share your demux quality QZV if you want us to take a look.

**
Here's the demux.qzv file.
demux.qzv (289.5 KB)

Nicholas_Bokulich · March 5, 2018, 3:41pm

It is expected that dada2 will filter out a large number of sequences at each of these steps — but this is only okay to a certain point. In my experience, dada2 will usually throw out 25-75% of the reads in a given dataset. If dada2 starts filtering out more than 90% of the input reads, there is either an issue with the input data (low quality) or with the processing steps. You are losing way too many reads at the merge stage, and this is very diagnostic:

This indicates that your reads are not long enough to join. Your read quality profiles look very good, and trimming each side at 220 nt seems reasonable (quality is good enough that you could probably go longer on both).

At this point I would usually be asking how long your amplicons actually are, to determine if 220 + 220 is enough to provide sufficient overlap. But I do not think that's the issue. (still, please answer the question — it will be good to know if this is a futile exercise if your amplicons are still too long)

It looks like the issue is that your reads are not all full-length. Many of them appear to be truncated to as little as 38 nt. What is going on there!? Have these reads been pre-filtered and trimmed prior to importing to QIIME2? This is going to cause problems with read joining — I would highly recommend using the rawest form of the sequences possible with dada2.

You should check out the raw reads — determine the read length frequencies if you can and figure out why so many truncated reads exist in there.

Then your possible options are:

Use the rawest reads available and follow the same steps that you already have. The untrimmed reads may provide dada2 a little more fodder for joining reads, even if they are riddled with errors.
Use only the forward reads for your analysis. The fact that reads are not overlapping will not be an issue here so you should still retain your reads (though note that you will lose reads that are shorter than trunc-len )
Use a shorter trunc-len so that shorter reads are not dropped, or set trunc-len to 0. That way, no reads will be truncated or dropped. The lack of truncation will risk losing some reads that have errors at the 3' end (probably a low risk — your quality profiles look really good), but will mean that shorter reads are not dropped at the truncation step, and hence may be retained until the joining stage (after dada2 has done its quality filtering). Even if these reads are shorter than 220 nt, they may still be long enough to join with their pairs.

Good luck!

Jeongsu_Kim · March 5, 2018, 5:09pm

Then your possible options are:

Use the rawest reads available and follow the same steps that you already have. The untrimmed reads may provide dada2 a little more fodder for joining reads, even if they are riddled with errors.

Input files for dada2 were raw fastq files which were demultiplexed-paired end reads. No other trimming/filtering was done before dada2. Do you think those raw fastq files are issue here? In this case what possible options do I have?

Use only the forward reads for your analysis. The fact that reads are not overlapping will not be an issue here so you should still retain your reads (though note that you will lose reads that are shorter than trunc-len)

Do you mean I shound run dada2 with forward reads and reverse reads separately and merge them after dada2?

Thank you so much for your help!

Nicholas_Bokulich · March 5, 2018, 5:36pm

These trimmed reads might be the issue — but maybe not (see below). Since you did not do this trimming, I wonder if your sequencing center/service did. Who performed the sequencing? You should contact them to learn more about if/what quality control may already have been done, since these unexplained short reads seem to be preventing reads from overlapping enough to join.

No. That is effectively what dada2 already does (denoise each read separately, then join). Instead, I propose using only the forward reads (discarding the reverse reads) and running those through dada2 to see how many reads you have.

The best first course of action may be to determine the distribution of read lengths in your data. You can do this with something a bit more formal like vsearch --fastq_stats if you already have a preferred way, otherwise try this (inserting in the path to your actual raw fastq):

gunzip -c path-to-file.fastq.gz | grep '^[ACGTN]*$' | awk '{ print length($0); }' | sort | uniq -c

This will tell us how many reads should actually be long enough to merge.

Though this gets me back to an earlier question: what marker gene/primers are you using, and what is the expected read length? It occurs to me that my thoughts on raw read length may be misguided, as those sequences are probably dropped earlier by dada2 (probably at the "filtered" stage, not at the "merged" stage), which may indicate that 220 + 220 is not quite long enough, or that the paired-read alignments are failing for some reason. (sorry, I just don't know enough about where dada2 actually does this length exclusion step to be sure about this). Are you sequencing ITS by any chance?

system · April 5, 2018, 11:45pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.