q2-cutadapt output

jessica.song · October 12, 2020, 8:21am

Hi,

Firstly, thank you to the moderators for always responding so promptly and being so helpful with queries.

My question is in relation to the q2-cutadapt output (attached). I find that about 50% of reads from all my samples are removed after this step. If I understand correctly, this is because the primers were not detected in these removed reads. I work on 2 x 300bp (Illumina MiSeq V3) 16S reads. Is this 'normal' for Illumina reads or am I losing too much?

The parameters I use are:

qiime cutadapt trim-paired
--i-demultiplexed-sequences # filepath of demultiplexed artifact
--p-cores 16 \
--p-front-f Forward primer sequence
--p-front-r Reverse primer sequence
--p-overlap 10
--p-error-rate 0.2
--o-trimmed-sequences filepath/primer-trimmed-demux.qza \
--verbose

Please find attached the demux.qzv file and the output report.

Thank you and I look forward to your response!

all-demux.qzv (290.3 KB)
primer-trimming.zip (21.7 KB)

andrewsanchez · October 14, 2020, 11:37pm

I haven't done too much with cutadapt yet, but I'll do some research and see if I can point you in the right direction, @jessica.song.

jessica.song · October 15, 2020, 12:33pm

Hi @andrewsanchez,

Any help would be very much appreciated Thanks again for taking the time to respond.

I have recently also trimmed the primers off a cDNA dataset using the same parameters and primer sets and yielded very similar results (~50% reads without primers detected and removed). I wonder if it is an issue with the parameters I have chosen or if it could be the quality of my dataset (both of which were Illumina MiSeq V3 2x300bp reads). Without a proper reference, it's hard to tell what is acceptable. What concerns me is that after both adapter removal and trim/filter steps, I am left with only 30 - 40% of the total reads I started with. I understand that there is no golden rule for all of this but am I losing too many sequences here?

Looking forward to hearing what you think. Thanks!

Jessica

andrewsanchez · October 15, 2020, 9:41pm

One simple thing you can try is to go with the default maximum allowed error rate (--p-error-rate) instead of increasing it, as you have done. And if you haven't done this already, I would also suggest taking a deeper look at cutadapt's documentation regarding the minimum overlap parameter to make sure you're choosing a number there that makes sense.

Let's see how that goes. Not sure how big of a difference this will make. There is some more relevant discussion here as well:

jessica.song · October 16, 2020, 8:05am

I have actually tried running the adapter trimming also on default parameters which if I am not mistakened sets the max. error rate at 0.1 and minimum overlap at 3 nt which appears to make a marginal difference (see below: attached).

trimmed-output-report(default).zip (21.1 KB)

I, of course, have also tried other combinations of parameters but found that I always yielded a similar result. I have also tried primer trimming just using dada2. However, having executed both pipelines and comparing them, it seems that the reads trimmed with dada2 are comprised of much more 'Unidentified Bacteria' in downstream taxonomic analysis as opposed to those trimmed with cutadapt (44% vs. 0.01% of total communities, respectively).

As I am still very new to all of this, I am not equipped to make any sort of educated guess as to why this might be the case. I wonder if perhaps you could shed some light on this.

I am not quite sure how to proceed with my data or if it is okay to do so having lost so many sequences. Could the quality of my sequences be the root of my problem?

I realize that I pose quite the open question and apologize for this. As you have probably already gathered, I need a lot of help in this department but I thank you for sharing your experience and taking the time!

cherman2 · October 23, 2020, 11:09pm

Hello @jessica.song,
Sorry for the late response on our part!
Andrew is out of the office this week, so he has been un able to answer your questions. He will be back soon.

andrewsanchez · October 27, 2020, 10:05pm

Hi @jessica.song,

Thank you for your patience!

Are you able to provide the SampleData[PairedEndSequencesWithQuality] artifact output by q2-cutadapt? I'd like to take a look at the provenance for cutadapt's output if possible.

After looking at your logs again, I'm curious if disallowing indels is appropriate in your case (User guide — Cutadapt 5.0 documentation). Same question for discarding reads in which no adapter was found.

I'm also wondering if you should specify an anchored 5' adapter as seems to be suggested in the cutadapt docs (User guide — Cutadapt 5.0 documentation).

There are also some more fine grained parameters exposed for paired end data which may be worth taking a look at (User guide — Cutadapt 5.0 documentation).

Unless your adapter may also occur in a degraded form, you probably want to use an anchored 5’ adapter.

This is certainly a possibility. Sometimes it is possible to recover lost data, but it usually takes a deeper understanding of your data and the tools like cutadapt. It seems to me that there is simply no way around digging deep into cutadapt's documentation to ensure your cutadapt command accurately reflects your adapter layout.

I'm happy to try and assist further on this after you get back me about where you are at now, since it has been awhile. Looking forward to hearing back. Also, take a look at this forum search for related topics.

jessica.song · October 28, 2020, 1:39pm

Hi @andrewsanchez,

Thank YOU for the help and for brainstorming with me!

Here is the artifact from the dataset we discussed previously: https://drive.google.com/drive/folders/1zywQjHyLtP7QxJ053ZGjoClDPJQp6t3t?usp=sharing
(filename: all-demux-trimmed.qza)

Additionally, I have run a couple of tests (as per your suggestions) just to isolate the problem. This time I used the entire dataset (16S and cDNA seqs) with the same primer set, 341F CCTACGGGNGGCWGCAG and 785R GACTACHVGGGTATCTAAKCC.

The different setups are as follows:

default parameters
overlap = 10, error rate = 0.2
overlap = 10; error rate = 0.2; 5' anchored
overlap = 10; error rate = 0.2; 5' anchored; discard untrimmed
overlap = 10; error rate = 0.2; 5' anchored; no indels

I have only attached the trimming logs (see: Google Drive link: primer-trimming-logs) due to the large file sizes of all the resulting artifacts but I can of course provide them if needed.

Based on preliminary observations, there are marginal differences between the output of both Setups #3 and #5. It would appear that the --p-no-indels condition definitely limits the trimming just to sequences at the target positions but the output nevertheless appears to be similar for both. The --p-discard-untrimmed command seems to be what is causing half of the reads to be lost but given my lack of experience in the area I cannot quite understand why, even after having read the cutadapt guide multiple times over.

I wonder what could be the reason behind not having detected the primers in 50% of total reads and what the effect would be of retaining those untrimmed sequences. Perhaps you could help me understand this a little better. I am also keen to learn what you think of the different outputs and what looks more 'acceptable' of an outcome to your more experienced eyes.

Thank you again for your guidance. As you can probably tell, I need plenty! Looking forward to hearing back from you!

Jessica

SoilRotifer · October 29, 2020, 4:11pm

The likely reason you are loosing so many reads is due to excessive mismatches as a result of IUPAC ambiguity codes in your primers. Try adding the --p-match-adapter-wildcards flag. More information can be found here:

andrewsanchez · October 29, 2020, 7:00pm

FYI, the relevant part of the discussion linked to above was moved to Cutadapt failing to remove all primers.

jessica.song · October 30, 2020, 1:01pm

Hi @SoilRotifer and @andrewsanchez,

I have actually tried including those flags but again, the output changes marginally.

cutadapt-output.zip (19.1 KB)

Interestingly enough, I have been in discussion with a colleague and he suggested that I try the commands for linked adapters. Having done so, it would appear that he was right and I managed to improve the primer detection rate for the forward reads substantially but not so much for the reverse (still only detected in about 50% of reverse reads). As I do not want to retain untrimmed reads, I am left at the end again with about half my read count per sample. I wonder though if this should provide any tells about what the problem might be here?

link-trimmed.zip (81.5 KB)

Attached in the second .zip file are the log files for the first run I tried with linked adapters on default parameters and the second with the added flags to match adapter and read wildcards.

I realize this might be, at this point, quite a drawn out question and I apologize for that but I worry I may have exhausted myself of all ideas as to what the issue might be. For that, I am grateful for any insight you can provide me!

Thanks again!

Jessica

SoilRotifer · October 30, 2020, 2:15pm

This is quite perplexing indeed.

Usually when I see such consistent loss across all the samples, it usually implies that the primer sequences are slightly incorrect. Or something else is consistently wrong... What sequencing protocol was used to generate this data? Do you have a reference?

These primers are identical to the ones below, except for the T in place of K in the reverse primer. Which should not be an issue as K == G or T (Herlemann et al. 2011).

Bakt_341F: CCTACGGGNGGCWGCAG
Bakt_805R: GACTACHVGGGTATCTAATCC

I would rerun with the following changes. I know you've tried much of this already, I am only adding 2 new ones:

try using the --p-anywhere-f instead of --p-front-f
try using the --p-anywhere-r instead of --p-front-r
do not use the --no-indels
remove the ^ from the beginning.
keep --p-match-adapter-wildcards
keep --p-match-read-wildcards
keep --p-discard-untrimmed

Sometimes odd things happen with sequencing. I think at one point I simply had to trim some bases off either end of the primers... but you'd think this should be solved by simply increasing --p-error-rate as you've done. Otherwise, if you are not using staggered primers (Lundberg et al. 2013), then you can forgo cutadapt and just use the trimming options provided in dablur or DADA2. But it is nice to use cutadapt as a form of quality control.

Would it be possible to share the demuxed data? You can DM me if you do not want to share it publicly.