dada2 collapseNoMismatch

ja.morillo · November 13, 2019, 1:03pm

Dear colleagues:
I am analyzing data obtained with the EMP protocol for fungi-ITS1 region (soils). I have 3 runs. Two runs were processed with the Illumina V3 kit (2x300 bp) but the third came from a run with the kit V2 (250 x2 bp). Of course the problem here is that if I ignore the shorter read lengths of the third run, I would get different ASVs from the same biological sequence in the third run.
Also, the highly variable read length of the ITS regions complicates this more. A solution for this particular problem is implemented in dada2 in R, with the function “collapseNoMismatch”, that “ collapse together variants with no mismatches or internal indels but that could differ by terminal gaps, i.e. variants that differ by their lengths and nothing else”. This would be great! Is there something equivalent in qiime2?
See: https://github.com/benjjneb/dada2/issues/716
Other 2 alternatives approaches that I tried with qiime2 are:

Trim the 300 bp raw sequences (R1 and R2) to 250 bp before DADA2, but I cant find a tool to do this in multiple fastq.gz files (I tried seqkit), and also I am not 100% if this would cause any bias. Something in qiime2 tools for this?
Denoise the three runs (I do the different runs separately and later merge ASV tables) using the option --p-trunc-len. In this example, I just use the first 175 bp of the R1 only (after cutadapt):
qiime dada2 denoise-single
–i-demultiplexed-seqs trimmed.qza
–p-trunc-len 175
–p-trim-left 0
–p-trunc-q 2
–p-max-ee 2
–p-n-threads 20
–output-dir dada2
–verbose
The rationale of this is that using the first 175 bp from the reads, I get rid of the problem. But also not 100% sure.
Any suggestion? Thanks!
Cheers,
Jose

colinbrislawn · November 13, 2019, 2:41pm

Hello Joes,

I think you are on the right track. Preprocessing your reads to the same length before running dada2 should give you consistent results.

(As far as I can tell, the collapseNoMismatch() function is not part of the dada2 qiime plugin. But that could be added in a future release!)

That's a great idea. Once the 300 bp forward reads are 250 long, they should cover exactly the same region as the 250 bp forward reads. I don't think this would cause any bias.

Ironically, this might cause less bias than using paired reads; not all ITS reads can pair so using just forwards reads will removing any potential pairing bias. Totally try that

Yep, and you are already using it! The --p-trunc-len option within dada2 is a preprocessing step that takes place before denoising, so it should be functionally identical to trimming fastq files directly. If you want to use vsearch or seqtk to trim a bunch of files, I can help you put together a bash script to loop through all of them.

I hope that helps! Let me know what you find,
Colin

ja.morillo · November 13, 2019, 5:05pm

Hello Colin. Thanks, great help!

Good to know that I am doing well, but I have an additional interesting point about --p-trunc-len option. We are using a mock community as a positive control. It is a commercial one that has 8 bacterial and 2 fungal strains (not too much but…). The two fungal are Cryptococcus neoformans and Streptomyces cerevisiae (https://files.zymoresearch.com/protocols/_d6305_d6306_zymobiomics_microbial_community_dna_standard.pdf).

The thing is: if I apply --p-trunc-len 240 bp (the quality of the data is OK and I could do it), I miss Cryptococcus in the output of DADA2. I figured out that the ITS1 region of this strain, when amplified with EMP primers is 178 bp. So…apparently --p-trunc-len X is not only truncating to X bp, but also removes all reads <X bp. That’s why I have to use --p-trunc-len 175 bp (or perhaps less) otherwise I can loose too much fungal diversity. But, doing this I am also throwing out too much data and damaging taxonomic resolution.

Am I right? Why --p-trunc-len X is also removing the shorter reads, and not just truncating? (I think that DADA2 can handle reads with variable length). Perhaps is not a big deal but it might be useful to know it.

Thanks in advance!

jose

colinbrislawn · November 13, 2019, 6:54pm

Hello Joes,

I'm glad you have got a positive control!

Let's figure out what dada2 is doing. I went to the plugins page, and opened up the one for dada2 denoise-single
https://docs.qiime2.org/2019.10/plugins/available/

Parameters:
  --p-trunc-len INTEGER  Position at which sequences should be truncated due
                         to decrease in quality. This truncates the 3' end of
                         the of the input sequences, which will be the bases
                         that were sequenced in the last cycles. Reads that
                         are shorter than this value will be discarded. If 0
                         is provided, no truncation or length filtering will
                         be performed                               [required]

Yep, you got it right. And it's the same in denoise-paired.

That's a great question, and a really important feature for variable length amplicons like ITS. Let's see what the lead dada2 dev recommends. @benjjneb?

Colin

benjjneb · November 14, 2019, 2:33pm

Yes this is correct, that is how trunc-len works. This is desirable in the 16S application, but not in ITS for the reasons you noted, and we recommend not using trunc-len for ITS, but instead relying on primer removal (borth forward and reverse) and max-ee filtering.

ja.morillo · November 14, 2019, 2:40pm

Thanks a lot! it makes sense for me.

ja.morillo · November 14, 2019, 2:51pm

Hi again Colin,

It seems then that the most correct approach to solve the issue is to truncate the raw fastqs before applying the DADA2 plugin. That would involve an script like you mentined before with a loop including a seqkit command for example. If you have something similar that might be shared, of course I would try it.
In any case, it might be great I think if the function collapseNoMismatch() is implemented in the dada2 plugin. Not only for this particular issue in this project, but in general it sounds like a good idea to collapse ASV with identical “indel” sequences, specially with ITS.
Thanks!

colinbrislawn · November 14, 2019, 5:57pm

Check out the examples of how to use gnu parallel:
https://www.msi.umn.edu/support/faq/how-can-i-use-gnu-parallel-run-lot-commands-parallel

So maybe something like

find ~/fastqfolder -name *.fastq | parallel "vsearch --fastq_filter {} --fastq_trunclen_keep 250 > trunc.250.{}"

Let me know what you try!
Colin

ja.morillo · November 15, 2019, 8:34am

Thanks Colin. Definitively I will try that way! It seems the more solid approach