comparison of single-indexed and dual-indexed MiSeq libraries - has anyone noticed big diversity differences?

ctekellogg · September 10, 2019, 10:06pm

Hi there,
A year ago, I generated 3 MiSeq libraries (515F-926R) using barcoded 515F primers and nonbarcoded 926R. I have since switched to a dual barcoding approach with both the forward and reverse primers having barcodes. All run at the same sequencing facility. When I run them through dada2, i get vastly different feature counts, with the dual-indexed samples have 2x as many features as the single-indexed libraries. This makes some sense since dual-indexing is supposed to reduce crosstalk and help increase the apparent diversity during the MiSeq run, but this magnitude of difference is quite surprising, especially given that range sequence depths is pretty similar among all runs. Has anyone experienced anything like this? Even when I do a PCoA, the 3 single-indexed runs group together, with the dual-indexed runs more different even though they all from roughly the same site, just over time (time series). I'm really hoping to analyze all of these samples together.....

Thank you for any insights.

Colleen

Nicholas_Bokulich · September 10, 2019, 10:27pm

By any chance are the single-indexed reads in a single orientation, and the dual-indexed reads are in mixed orientations? Depending on how the library preps were performed, this could be the case, which would explain both the 2X higher diversity and the dramatically different ordination results, since reverse-complements of the same read will be registered as a unique OTU!

ctekellogg · September 10, 2019, 10:35pm

Hmmm. I am actually not sure.

The single-indexed reads were performed using the EMP 515F barcoded primer constructs. The dual indexed reads were prepared as described here: https://msystems.asm.org/content/2/1/e00127-16 (and Microbiome Amplicon Sequencing Workflow · LangilleLab/microbiome_helper Wiki · GitHub for the nitty gritty).

In the lab it was a pretty similar workflow; just different primer constructs and different methods for sample cleanup before pooling. But, I am not sure if something different would happen at the sequencing end (I didn't personally run the MiSeq...). I suppose perhaps the sequencing center would know the answer to this question? Or do you have a sense, given the detail here.

Thank you for your quick response! I am keen to get the bottom of this!

Nicholas_Bokulich · September 10, 2019, 10:43pm

Hm... both of those protocols include the Illumina adapters in the primer constructs, so the reads should all be running in the same direction. HOWEVER, the two protocols could have the forward adapter attached to different primers (forward vs. reverse), so that even though the PCR primers are identical the reads are reversed. Check it out! There are probably some easy ways to check (aside from the best method: digging into the protocol and consulting the sequencing center) — RC a handful of reads from one run and attempt to align them to the other run using your favorite method.

If that's the issue, QIIME 2 does not (yet) have a method to reverse-complement reads, but VSEARCH (installed as part of the QIIME 2 installation) does have a reverse-complement fastx command that you can use to RC the fastq (before import) or fasta (export, RC, re-import to QIIME 2).

ctekellogg · September 10, 2019, 10:54pm

They both seem to have the same adapter on the forward primer (sadly, i guess):

EMP Primers:

Langille Lab Constructs

So, this should mean they are in the same orientation, correct?

I thought maybe it had to do with slightly better read quality with the dual indexing (so fewer reads get filtered out), but I don't think that is it. Especially since the forward reads of all runs were pretty good, so I have also used qiime dada2 denoise-single to see if that would remedy the issue, but it doesn't seem to. (still trying a few more things...but odds are, no).

ctekellogg · September 10, 2019, 11:17pm

Also, I should mention that for the dada2-paired denoising I get ~3k-4.5k features for the single-indexed samples (6k-7k if I do dada2-single) and closer to 13k for the dual-indexed libraries (and ~20k, if using dada2-single for these). These are pretty substantial differences!

Lengths between the single and dual-indexed libraries are different when denoising and merging (still haven't to tinker with the trimming and truncation settings there to get them to better align because I expect this might cause them to not merge well in the end, which could also influence how things look in the PCoA). But regardless, I find the giant number of ASVs in the dual-indexed libraries is puzzling, because from what i have read, it seems like people often get ASVs amounts much closer to what I got for my single-indexed libraries.

thanks again for your help!

Nicholas_Bokulich · September 11, 2019, 2:18am

yes — though I suppose the sequencing primer may be what determines read direction? Perhaps you should check with the sequencing center and/or spot check some of the sequences in each run just to confirm.

I agree, those ASV counts sound phenomenal so something is not adding up. @gmdouglas do you have any clue what could be going on here re: different processing methods required between EMP vs. Langille protocols?

ctekellogg · September 11, 2019, 6:06am

I was also actually thinking of writing @gmdouglas - since all of these amplicon pools were sequenced at the IMR - just to see if he or anyone there has run into this. I've pretty much followed the SOP you've written (Amplicon SOP v2 (qiime2 2019.7) · LangilleLab/microbiome_helper Wiki · GitHub) - which is awesome. Thank you.

For example, for a particular run.

Import reads

mkdir reads_qza

qiime tools import \
   --type SampleData[PairedEndSequencesWithQuality] \
   --input-path demux/ \
   --output-path reads_qza/kellogg_16S_IMR4sequences.qza \
   --input-format CasavaOneEightSingleLanePerSampleDirFmt

How's the imported data look:

qiime demux summarize \
   --i-data reads_qza/kellogg_16S_IMR4sequences.qza \
   --o-visualization reads_qza/kellogg_16S_IMR4reads_untrimmed_summary.qzv

Trim primers:

qiime cutadapt trim-paired \
   --i-demultiplexed-sequences reads_qza/kellogg_16S_IMR4sequences.qza \
   --p-cores 4 \
   --p-front-f GTGYCAGCMGCCGCGGTAA \
   --p-front-r CCGYCAATTYMTTTRAGTTT \
   --p-discard-untrimmed \
   --p-no-indels \
   --o-trimmed-sequences reads_qza/kellogg_16S_IMR4reads_trimmed.qza

Hows the data look after trimming (19-20bp shorter - good news!):

qiime demux summarize \
   --i-data reads_qza/kellogg_16S_IMR4reads_trimmed.qza \
   --o-visualization reads_qza/kellogg_16S_IMR4reads_trimmed_summary.qzv

And then run dada2 (i've tried a whole bunch of settings here to try and compromise between my old runs and these new ones; here is one of the many):

qiime dada2 denoise-paired --i-demultiplexed-seqs reads_qza/kellogg_16S_IMR4reads_trimmed.qza \
                               --p-trim-left-f 30 \
                               --p-trunc-len-f 270 \
                               --p-trim-left-r 30 \
                               --p-trunc-len-r 210 \
                               --p-max-ee-f 3 \
                               --p-max-ee-r 5 \
                               --p-n-threads 8 \
                               --output-dir dada2_output

Any thoughts would be so appreciated! Seems to work fine for the old runs (though, I import them differently using qiime demux emp-paired), but then yields so many more features for the dual-indexed libraries run a few weeks ago.

gmdouglas · September 11, 2019, 2:12pm

Hey @ctekellogg, thanks for reaching out. André (the IMR manager) is away until next week, but I'll pass on your message to him for when he gets back and I'm sure he'll have some ideas once he gets caught up.

In the meantime I can give my two cents although I'm certainly no expert on sequencing protocols. The read orientations should be the same between the two protocols if I understand your question correctly. In terms of the differing numbers of ASVs I would check to see how the total read depth and the proportion of rare ASVs differ between the two protocols. Either much higher read depth or more low-frequency, spurious ASVs in the dual-index protocol due to different degrees of sequencing error might help explain this observation. There could also be a batch effect due to bleed-through or other noise affecting one run more than another, but that's difficult to assess without more information.

Best,

Gavin

ctekellogg · September 11, 2019, 8:46pm

Hi @gmdouglas and @Nicholas_Bokulich - The sequencing depth was not toooo far off among all runs (on average 70-75k before denoising, with one run that is of not the best quality - it has 55k). Just to make sure I am doing everything roughly correctly (in your minds) - mind if I run a few things by you, with a few more questions?

For the single-indexed reads, I receive the data in the "normal" EMP way of 3 files: forward.fastq.gz, reverse.fastq.gz, and barcodes.fastq.gz. I then use the following (for example) to demultiplex the reads.

qiime tools import \
  --type EMPPairedEndSequences \
  --input-path reads \
  --output-path hakai_16S_IMR1sequences.qza

This brings me to (new) question 1:
If I am correct, in this process, all non-biological sequences are stripped from the data, correct? (running cutadapt with the primer sequences would suggest as much). However, this yields reads that are on average 297 bp long, which to me suggests that the primers still need to be trimmed (for a 2x300 run), but if I run cutadapt, most reads get thrown out because they don't have the primer.

Regardless, here is what the quality plots look like after importing. This run was a bit overclustered, so the turnaround is not great, as you can see.

visualization.qzv (301.8 KB)

With the dual-indexed runs, they were demultiplexed by IMR (saving me a step; yay!). As I mentioned above, I import them using the following:

qiime tools import \
   --type SampleData[PairedEndSequencesWithQuality] \
   --input-path demux/ \
   --output-path reads_qza/hakai_16S_IMR4sequences.qza \
   --input-format CasavaOneEightSingleLanePerSampleDirFmt

At this point the reads are on average 300 bp. I then use cutadapt to trim primers, which yields reads that are about ~281-282 bp, on average. Quality plot example here:

hakai_16S_IMR4reads_trimmed_summary.qzv (315.1 KB)

So, then, if I attempt to use the same settings in dada2 for each run, say:

--p-trim-left-f 30 \
--p-trunc-len-f 270 \
--p-trim-left-r 30 \
--p-trunc-len-r 210 \
--p-max-ee-f 3 \
--p-max-ee-r 8 \

I guess I would think that the merged reads will end up with different length reads because the starting reads were different lengths, but I guess this is not the case....thanks to truncating at the same basepair in each.

single-index:

dual-index:

It does seem like the average lengths are roughly the same at ~312, but they are still shorter than i would expect (411 bp -60 bp from trimming should be 352....unless i am misunderstanding something, which is entirely possible). Are these shorter than they should be?

And, I guess the dual-indexed runs have longer max-lengths because the R2 read quality is generally better so then fewer reads get denoised away?

Anyway - moving on to feature frequencies.

You can see that the single-indexed run has fewer rare ASVs than the dual-indexed run, with a higher frequency for the median.

For the dual-index library example:

So, I suppose this could be the source of some of the discrepancies in ASV# between the two, as per your suggestion, @gmdouglas. Turns out, if I filter out more of the rare reads (say, --p-min-frequency 10 and
--p-min-samples 2), then the different types of libraries start to converge on each other when I do a quick PCoA. PHEW. I had only filtered --p-min-frequency 2 at the start prior to writing this post, which perhaps wasn't enough to get rid of the higher diversity of the rare ASVs in the dual-indexed runs.

This brings me my final question: It is my understanding that with dada2, one should denoise each run separately but using the same settings, and then merge. Or, is it okay to use different settings for different runs, so long as they yield the same length merged reads? I ask because one of my runs was not so great, and in order to merge the reads, i need to do a bit more trimming and truncating than the other 4 runs. Is this okay? Or do I need to resequence this amplicon pool (with the hopes of a more successful run?)?

Thank you!

gmdouglas · September 11, 2019, 9:27pm

Thanks for sending all of these details!

Those are the 3 files you need for importing data from the EMP protocol (see the importing tutorial for example input files to compare with if you're concerned).

All non-biological sequences should definitely be removed. If you're concerned that they aren't being removed correctly it's always a good idea to look at a few raw FASTQs by eye and in a tool like FASTQC. You should be able to see the forward and reverse primers at the beginning of the sequences before running cutadapt and the barcode sequences should be removed after importing them as QIIME 2 artifacts. The Illumina adapter sequences should have been removed before you received your raw reads.

In terms of the read lengths I think it makes sense once you include the primer lengths, which have already been trimmed off. On a related note - is there a reason why you trimmed the beginning of the forward and reverse reads with DADA2? This is normally done as an alternative way to remove primer sequences, but I think in this case you're just throwing out good data. If so that might partially explain why you're getting noisier ASVs with the dual-index data. I might be missing something though!

The mean ASV frequency is quite high, which is a good thing since that means you have high depth! Although there's little consistency in the field our lab always change our depth cut-off based on the total sequencing depth (i.e. a cut-off of 10 reads doesn't make sense to apply across datasets of varying depth). You can see the approach we use here.

I'm not sure about how using different options for different runs would affect the DADA2 output. My instinct is that you should use the more conservative filtering required for that outlier run for all runs so at least they are consistent. Whether you want to do resequencing or not depends on what proportion of data would need to be thrown out to use the conservative filtering. This would probably be a good question for the DADA2 developers however.

Nicholas_Bokulich · September 11, 2019, 9:46pm

To add to @gmdouglas's exquisite advice:

You have confirmed that you are keeping all dada2/workflow parameters consistent between runs. Nevertheless:

this could simply be explained because you have different read lengths. Even slightly different lengths (e.g., because primers are still attached to reads from one run but not the other), would lead to exclusive ASVs between runs.

what distance metric are you using? If you are using a phylogenetic metric (e.g., UniFrac), I would still expect some degree of overlap between protocols, since the reads should map to similar locations on the tree even if they are different lengths. This is why I initially thought mixed or opposing orientations could explain your issue.

But non-phylogenetic methods will lead to total separation because all ASVs are mutually exclusive between protocols. So I suggest trying UniFrac if you are not already, at the very least to diagnose why you are getting disparate results.

ctekellogg · September 11, 2019, 11:08pm

Thank you to you both for your replies!

I checked the reads from the EMP-type samples and they definitely don't seem to have any primers at the start (either by cutadapt, or just by looking at a handful of the reads by eye)...it is just weird to me then that these reads are 297 bp and don't seem to have primers at the start, while the dual indexed reads are 300 bp, you can clearly see the primers at the start of the read, you trim off the primers which brings them down to 281-282 bp. It seems my EMP-type runs have always been this way - straight from the sequencer, with no primer in the reads. Where are the primers...? I guess I'll just have to stop pondering/overthinking this.

Regarding this:

I am still playing around with the settings a bit, but the 3 runs that were nearly overclustered have pretty poor quality at the start of R2, so i thought it best I trim those? (i definitely have to for the super noise run or else basically none of the reads pair). And you can see a hint of this in the quality plot for the single-indexed run above. I guess I figured it best to cut that away and then just be consistent across all runs. I might not have to cut that much for the forward reads though. What do you think?

@gmdouglas. You mention:

Does this then mean you think you should filter the unmerged tables and then merge, so that you can apply the 0.1% of the mean sample depth (so, mean frequency from the frequency per sample table below) on a per run basis? I guess I was thinking it would make sense to be consistent across the dataset as a whole....but maybe not.

Once I merged the denoised runs:
17%20PM

25%20PM

Yes! I fear this is an issue, which is why i went through double checking read lengths above, with the average being 312.

For distance metric - I was using bray curtis for the quick check (because the tree was still running). When I merged the runs a few days ago (vs. what i just did, with the bigger # of ASVs removed), even unifrac didn't completely fix the disparity. Perhaps it a combined issue of rare ASVs and read length differences after pairing.

I am keen to use ASVs because I am working with time series data, so constantly having to recluster doesn't seem ideal in the long run. But merging these ASV tables definitely has a lot of nuances and really rides on sequences being the same length (which makes sense, when you are looking at exact sequences, obviously), but it allows for little variability in how successful a MiSeq or HiSeq run is (which might require more tinkering with the denoising settings). I've been contemplating giving deblur a shot too, perhaps pairing with vsearch or pear would allow for a bit more wiggle room in run quality...?

Thanks again to both of you - this conversation has been incredible helpful to me.

colleen

gmdouglas · September 12, 2019, 1:52pm

I'm not sure why the read lengths are different or why the primers aren't present, but I haven't worked with raw EMP-type data before. I think André (the IMR manager) will have a better idea of what's going on here when he's back at work though.

Based on those plots I wouldn't trim the forward reads from the left at all and only trim the beginning of the reverse reads if it interferes with DADA2.

There's really no consistent way to do that kind of filtering in the field, but I personally would figure out the min depth cut-off for each run independently. If these cut-offs are similar then I would use the highest cut-off across all runs to be conservative, which would keep that step consistent for all samples. If there is a lot of variation then I would use a different cut-off for each run, although if you do that it would be important to make sure that your sample groupings were equally affected by this step (i.e. to make sure there's not a confounding batch effect).

I haven't tried producing identical ASVs across different sequencing runs, but I would probably restrict my analysis to the intersect of homologous DNA produced by all sequencing runs (i.e. throw away DNA in ASVs that was trimmed off by other runs). I'm not sure how feasible that is to do, especially if you'll be doing ongoing sequencing.

Gavin

ctekellogg · September 16, 2019, 10:07pm

Hi @gmdouglas - quick question for you. You mention a filter of 0.1% of mean sample depth. I just want to clarify that you are referring to the mean frequency per sample (and not the mean frequency per feature) in the tables I included above. So, you would apply a filter of 0.001*40,345 for this library? So, this would mean, using a p-min-frequency filter of 40 (so an ASV must be present in the dataset at least 40 times to be retained)? This seems pretty high! But, hey - if that is what the bleed-through rate is, then that is a nice standard to go by.

Thanks!
colleen

gmdouglas · September 17, 2019, 12:17pm

Hey @ctekellogg,

Yes that cut-off refers to the mean sample depth and yes I would use a cut-off of ~40 if the mean sample depth is 40,345. It's hard to assess the actual bleed-through rate in practice, but that's what Illumina has reported for the MiSeq.

Best,

Gavin

kizza · September 23, 2019, 5:07am

@ctekellogg and @gmdouglas I just want throw this comment in about EMP protocol and the 16S primers. I've analysed a couple of data sets that were targeting V3-V4 regions using EMP protocol, from here https://www.pnas.org/content/108/Supplement_1/4516

This protocol is a single PCR reaction protocol, meaning you can grab out 16S region of interest and include all of the necessary adaptors in a single PCR reaction. This is different to an illumina 16S protocol that has two PCR steps 1. get the 16S regions 2. include other sequencing adaptors

This an image from that paper mention above

which shows all of the steps, but importantly if you look at the bottom panel, the sequence primer, the one that you load with illuman reagents contains your degenerate - 16s primer sequence

Which should never appear in your resulting read - FASTQ file. As you only sequencing "green" part of the read.

Now that I've learned that, EMP protocol doesn't require any adaptor trimming, because they should not be any there.

I'm pretty confident of that. I've spent a quite some time thinking about and talking to the sequence facility operators as to what happens at the library prep stage, but it would be great if somebody could second this as well. I think André (the IMR manager) will have a better idea ?

Cheers

ctekellogg · September 23, 2019, 5:44pm

Thank you so much for this explanation @kizza!! Now I can stop stressing about that aspect of things.

colleen