Miseq paired-end workflow

ksn · April 28, 2018, 2:50pm

Hi,
I have demultiplexed paired-end data (two .fastq files). I did the following steps:

Import

qiime tools import \
 --type 'SampleData[PairedEndSequencesWithQuality]' \
 --input-path raw \
 --source-format CasavaOneEightSingleLanePerSampleDirFmt \
 --output-path demux.qza

summarize

qiime demux summarize
--i-data demux.qza
--o-visualization demux.qzv

stats of seq. counts are (min ~16K, mean ~30k, max ~60k)

denoise

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux.qza
--p-trunc-len-f 0
--p-trunc-len-r 210
--p-n-threads 24
--o-table table.qza
--o-representative-sequences rep-seqs.qza

summarize denoise

qiime feature-table summarize
--i-table table.qza
--o-visualization table.qzv
--m-sample-metadata-file st.csv

qiime feature-table tabulate-seqs
--i-data rep-seqs.qza
--o-visualization rep-seqs.qzv

When I checked the outputs, I noticed that the sequence counts were quite low (max 1236 and min 86, mean 488).

I am wondering if I have missed some steps and parameters that may have led such low counts or it's just normal results.

Reverse reads have very low quality towards the end (may be more strict trimming could help?) and would you even suggest to take only forwards reads in the analysis?

SoilRotifer · April 29, 2018, 1:56pm

Hi @ksn, hopefully I can help a bit. Though I’d need more information to help answer your questions.

What primer-pair are you using?
What is the expected amplicon size (in base-pairs)? I assume ~ 488 bp? If so, there will be a limited window of overlap given the specified trim settings. Which may be a problem given the quality-scores.
As you are pointed out, the quality at the end of the reads can limit your ability to merge as well. So, there is a balance between length trimming, the quality of the bases near the end of the reads (which affects merging), and the number of reads you would like to retain post-merging.
I’ve processed quite a few data sets like this, more often encountering these issues from 2 x 300 kits. In this case I’ve continued on with only processing the forward reads. You can feed your output from your demux directly to denoise-single (it will know to only use the forward reads).
Given #4, I try to sanity-check how much data I may be losing/gaining by downstream analysis of both the data generated from the forward reads and the merged reads.

-Best wishes

ksn · April 29, 2018, 2:09pm

Thank a lot Mike.

The primers used for sequencing were Ba9F and Ba515Rmod1 targeting the 525 bp.

I did not remove chimeras while denoising - I realised only after getting the results but now I am confused whether pooled or consensus based removal is better.

I will consider only forward reads and compare the results (i was already planning to do that but thanks for suggesting).

Best

SoilRotifer · April 29, 2018, 2:35pm

Glad I was able to help @ksn.

The choice of which chimera removal method to use largely depends upon the biological question you are trying to address. Especially, if your question depends upon analyzing low-diversity samples. I found this post about sample diversity on the DADA2 GitHub help page. There is also a discussion about sample "error history". The following quote taken from this discussion:

First, it is not advised to pool samples that don’t share an “error history”, in particular samples that come from different sequencing runs or different PCR protocols. Samples from different runs should typically be run through the dada(...) function separately, so that the correct run-specific error rates can be learned.

Finally, pooled processing can take longer to process.

Certainly quite a bit to consider. Hopefully, other users in the forum can contribute their experiences between the two approaches.

-Best wishes

ksn · April 30, 2018, 5:58am

Excluding reverse reads significantly improved results (even the minimum frequency is now more than 12k compared 86 from paired-end).

I will remove chimeras outside DADA2 i.e using q2-vsearch.

One additional question: I have tecnical replicate for one of the samples - what’s the process to treat it during downstream analyses such comparing differential abundance.

SoilRotifer · April 30, 2018, 1:40pm

As you know DADA2 will remove chimeras for you using a de novo approach.

Be sure to check the output of uchime-ref. There can be issues with the false-positive detection of chimeras when using usearch / vsearch with default settings. The reference sequence database being used and the parameter settings can affect how well uchime-ref can detect chimeras. There have been a few cases in which some of the most abundant OTUs were removed from the data even though they were clearly not chimeric based on follow-up checks.

To combat this, I've found that increasing the --p-minh to ~1.0 - 2.0 works well for 16S (decreases the sensitivity of detecting chimeras). But you should play with the parameters and sanity-check that the sequences flagged as being chimeric are reasonable.

One additional question: I have technical replicate for one of the samples - what’s the process to treat it during downstream analyses such comparing differential abundance.

As for your replicate sample, I'd consider the advice here.

-Best

ksn · April 30, 2018, 3:32pm

Thanks Mike.

Still about chimera removal method - would you recommend DADA2 de novo approach or vsearch method (reference based or de novo)?

Best

SoilRotifer · April 30, 2018, 4:01pm

No worries @ksn, glad I was able to help.

The de novo chimera removal approach occurs by default, e.g. --p-chimera-method consensus, when using DADA2 via QIIME, unless you tell it to use the --p-chimera-method pooled approach or not to perform chimera removal --p-chimera-method none. I’ve not had any issues with using the default. Again, you need to consider if you should use either the pooled or consensus approach we discussed earlier.

I did not mean to deter you from using uchime-denovo. In fact, many (myself included), occasionally make use of both de novo and reference-based chimera removal. I only meant to point out the considerations at each step.

-Best

ksn · April 30, 2018, 5:27pm

Hi @SoilRotifer ,

I was trying alternative method because I didn't want to give up with the reverse reads. This time, I joined the reads using vsearch and the quality plot looks like below.

The stats of squence quality score doesn't look promising though.

vsearch v2.7.0_linux_x86_64, 126.0GB RAM, 24 cores
GitHub - torognes/vsearch: Versatile open-source tool for microbiome analysis

Merging reads 100%
18634 Pairs
13608 Merged (73.0%)
5026 Not merged (27.0%)

Pairs that failed merging due to various reasons:
25 too few kmers found on same diagonal
1 potential tandem repeat
2624 too many differences
2336 alignment score too low, or score drop to high
40 staggered read pairs

Statistics of all reads:
301.00 Mean read length

Statistics of merged reads:
512.85 Mean fragment length
15.16 Standard deviation of fragment length
0.61 Mean expected error in forward sequences
3.78 Mean expected error in reverse sequences
0.83 Mean expected error in merged sequences
0.40 Mean observed errors in merged region of forward sequences
3.90 Mean observed errors in merged region of reverse sequences
4.30 Mean observed errors in merged region
vsearch v2.7.0_linux_x86_64, 126.0GB RAM, 24 cores
GitHub - torognes/vsearch: Versatile open-source tool for microbiome analysis

I am planning to follow deblur method this time. Is there anything I would need to consider esp. on trimming because, when we take into account the joined reads, the quality is low in the middle of the sequence.

Alternatively, even if I continue to use DADA2, do you think that trimming off (--p-trim-left-r 50) around 50bases from 5' instead of truncating (--p-trunc-len-r 250) the length after around 250 will help in joining properly?

I am currently trying multiple options and I am sorry to ask about every issues.

Best

SoilRotifer · April 30, 2018, 8:42pm

Hi @ksn,

Given your quality score plot, I would not suggest using these merged reads for down stream analysis.

There may be negative consequences of passing low-quality sequence data into denoising / exact sequence variant methods like deblur and DADA2. Some reasoning is outlined by @benjjneb here.

-Mike

wym199633 · May 13, 2018, 9:43pm

Hi Mike,
You mentioned that use denoise-single will only use the forward reads. But do you know any method to only use the reverse reads? My data have the high-quality reverse reads, but low quality forward-reads. Thanks in advance!

Oliver

SoilRotifer · May 13, 2018, 10:41pm

Hi @wym199633,

I suspect that you should be able to trick QIIME 2 to import the reverse reads as forward reads by using the manifest format approach. That is, just set the direction column for these reads to forward for all your R2 reads, and do not include your R1 reads.

I can not speak to any sequence data validation steps that occur via this or other import approaches. Perhaps @ebolyen, @thermokarst, or someone else can provide some insight on weather this will work, or if there is a more appropriate option.

-Best
-Mike

Nicholas_Bokulich · May 14, 2018, 1:50pm

Hi @wym199633,
@SoilRotifer is correct — you can trick QIIME2 into using the reverse reads as forward reads by using the manifest file format. With the exception of the CASAVA1.8 paired-end format, which expects specific filename patterns to differentiate the forward and reverse reads, all other formats in QIIME2 are direction-agnostic.

wym199633 · May 14, 2018, 2:04pm

Thank you and @SoilRotifer! Why I didn’t think this before Smart!

Oliver

SoilRotifer · May 14, 2018, 2:06pm

Hi @wym199633, glad I was able help! Thanks @Nicholas_Bokulich for confirming!

-Best

system · June 14, 2018, 8:07pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.