DADA2 with COI primers - hard to diagnose sequence loss?

elaine-shen · January 26, 2021, 6:50pm

First forum post ever, so my apologies in advance for any errors. I am using qiime2-2020.11 on conda on macOS running Big Sur (11.1).

I am processing COI data from a 2x300 Illumina MiSeq run. Samples were never multiplexed, so I proceeded after importing with primer removal using cutadapt. I am using the Leray (2013) primers that have an expected length of 313 bp. I am at the DADA2 filtering/merging/denoising step that many get hung up on this forum.

I've tried adjusting the parameters based on recommendations in this forum (varying truncation lengths, increasing --p-max-ee-f and --p-max-ee-r numbers up to 15, trimming off 40 bp and successively less, and ensuring enough overlap). Below is my sequence quality profile and the parameter that seems to retain the most sequences at each step, but also reduces the sequence length to 259 bp. My question is: is this (approximately) the best I can expect from my data?

    qiime dada2 denoise-paired \
	--i-demultiplexed-seqs trimmed-seqs-EB101-194.qza \
  --p-trunc-len-f 200 \
  --p-trunc-len-r 160 \
	--p-trim-left-f 40 \
  --p-trim-left-r 40 \
	--p-max-ee-f 5 \
	--p-max-ee-r 5 \
  --o-table table-EB101-194.qza \
  --o-representative-sequences rep-seqs-EB101-194.qza \
  --o-denoising-stats denoising-stats-EB101-194.qza \
	--verbose

Here's the cutadapt code I used, in case this is where the mistake is (is there a way to check to see if you properly removed primers with cutadapt?):

    qiime cutadapt trim-paired \
    	--i-demultiplexed-sequences paired-end-demux-EB101-194.qza \
    	--p-front-f ^GGWACWGGWTGAACWGTWTAYCCYCC...TGRTTYTTTGGTCACCCTGAAGTTTA \
    	--p-adapter-r ^TAAACTTCAGGGTGACCAAARAAYCA...GGRGGRTAWACWGTTCAWCCWGTWCC \
    	--o-trimmed-sequences trimmed-seqs-EB101-194.qza

Additionally, I noticed on this post that 'standard chimera detection won't work' for COI - can someone explain this, and how to best move forward?

Thank you all for your suggestions to my post, and the many others I've consulted! Running out of ideas on my end

denoising-stats-EB101-194.qzv (1.2 MB) table-EB101-194.qzv (1.0 MB) rep-seqs-EB101-194.qzv (2.9 MB)

andrewsanchez · February 1, 2021, 10:51pm

Hi @elaine-shen, and welcome to the forum!

First of all, this is an excellent first post! Thanks for putting the effort into formulating a good question

These might be a bit long. Given the quality profiles, you can probably "trunc more to get more."

Trimming may not be necessary here. Your primers are already removed and 5’ quality is great.

Expected errors of 5 rather than the default of 2 is quite relaxed. You might be better off truncating more and using the defaults for --p-max-ee-f and --p-max-ee-r.

You can check your length distributions before and after. You can also run cutadapt with the --verbose flag to see additional information from cutadapt. This will show a lot of interesting and useful information. You will need to consult this section of cutadapt docs about how to interpret a cutadapt report.

Not too sure about this, but it looks like your issue is in merging so this might not be something you need to be worried about

(Thanks, @jwdebelius and @Nicholas_Bokulich!)

elaine-shen · February 2, 2021, 7:39am

Hi @andrewsanchez, thanks so much for getting back to me! I didn't realize there was a "trunc more to get more" strategy here - thank you for your suggestion. I ended up using (190, 140) as my truncation lengths with a maxEE of 2 using DADA2 in R per your suggestions, and also used the trimRight parameter to take off 13 bp off the right side, which would be a useful param to include in qiime2 as well.

I also discovered MultiQC in the interim and noticed a lot of nextera transposae contamination in my sequences, which prompted me to run cutadapt and Trimmomatic (and, subsequently DADA2) outside of qiime, which seems to result in less sequence loss later down the line.

Now to deal with taxonomic classification with COI...hoping someone has a nice pre-trained classifier that can be run locally...one can dream (though shoutout to @devonorourke for the BOLD classifier. Guess it's time to finally learn how to use our servers

andrewsanchez · February 4, 2021, 6:47pm

I have a few of questions about this. I'll get back to you after I get some answers.

To clarify, did you use qiime cutadapt? And out of curiosity, why both cutadapt and Trimmomatic? You probably already know this, but keep in mind that you anything done outside of QIIME won't be accounted for in provenance data.

Good luck! Rumor has it that Devon's classifiers are memory intensive and that the NCBI classifier might be a pinch smaller. Here are some other links that may come in handy:

other shared classifiers from @devonorourke (I think)

devonorourke · February 4, 2021, 8:47pm

It's definitely not a job for most laptops. Generating classifiers for the entire BOLD or NCBI COI sets required somewhere between ~200-500GB of memory, if I remember correctly. I couldn't use the cluster's regular CPUs with 128 GB RAM, but I didn't need the 1.5 TB node either.

One option for those without access to such large memory allocations would be to think about filtering by taxonomic group. If you're certain you're only working with fish, for example, you can drastically cut down the amount of memory required by subsetting for just those taxa on the front end before building the classifier . If you're working with moths, well, sorry, you're kinda stuck... there's so many dang months...

elaine-shen · February 5, 2021, 3:44pm

At first, I ran qiime cutadapt, but since I was trying to diagnose my merging problem, I later took it out of qiime to be able to view my outputs in MultiQC. I suspected that the reason why there was such low merging in DADA2 was because I still had residual Illumina adapters present in my sequences - when I looked at my MultiQC output from my raw demultiplexed paired-end data and after running cutadapt, it seemed like there were still N base calls present and some 'nextera transposae' contamination. I decided to use Trimmomatic to remove these, followed by cutadapt (full transparency, still running that).

Thanks! I have consulted these, I'll try them out and report back!

I believe I can request more memory than 128 GB, so going to try that next as well. I am trying to look at anything metazoan (not as taxa-specific as other COI uses), so all the arthropods are what's really doing me under, I think.

Thank you both for the thoughtful discussion and suggestions!

system · March 8, 2021, 10:07pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.