q2-cutadapt output

Hi @jessica.song,

Thank you for your patience! :pray:

Are you able to provide the SampleData[PairedEndSequencesWithQuality] artifact output by q2-cutadapt? I'd like to take a look at the provenance for cutadapt's output if possible.

After looking at your logs again, I'm curious if disallowing indels is appropriate in your case (User guide — Cutadapt 4.7 documentation). Same question for discarding reads in which no adapter was found.

I'm also wondering if you should specify an anchored 5' adapter as seems to be suggested in the cutadapt docs (User guide — Cutadapt 4.7 documentation).

There are also some more fine grained parameters exposed for paired end data which may be worth taking a look at (User guide — Cutadapt 4.7 documentation).

Unless your adapter may also occur in a degraded form, you probably want to use an anchored 5’ adapter.

This is certainly a possibility. Sometimes it is possible to recover lost data, but it usually takes a deeper understanding of your data and the tools like cutadapt. It seems to me that there is simply no way around digging deep into cutadapt's documentation to ensure your cutadapt command accurately reflects your adapter layout.

I'm happy to try and assist further on this after you get back me about where you are at now, since it has been awhile. Looking forward to hearing back. Also, take a look at this forum search for related topics.

1 Like

Hi @andrewsanchez,

Thank YOU for the help and for brainstorming with me!

Here is the artifact from the dataset we discussed previously: https://drive.google.com/drive/folders/1zywQjHyLtP7QxJ053ZGjoClDPJQp6t3t?usp=sharing
(filename: all-demux-trimmed.qza)

Additionally, I have run a couple of tests (as per your suggestions) just to isolate the problem. This time I used the entire dataset (16S and cDNA seqs) with the same primer set, 341F CCTACGGGNGGCWGCAG and 785R GACTACHVGGGTATCTAAKCC.

The different setups are as follows:

  1. default parameters
  2. overlap = 10, error rate = 0.2
  3. overlap = 10; error rate = 0.2; 5’ anchored
  4. overlap = 10; error rate = 0.2; 5’ anchored; discard untrimmed
  5. overlap = 10; error rate = 0.2; 5’ anchored; no indels

I have only attached the trimming logs (see: Google Drive link: primer-trimming-logs) due to the large file sizes of all the resulting artifacts but I can of course provide them if needed.

Based on preliminary observations, there are marginal differences between the output of both Setups #3 and #5. It would appear that the --p-no-indels condition definitely limits the trimming just to sequences at the target positions but the output nevertheless appears to be similar for both. The --p-discard-untrimmed command seems to be what is causing half of the reads to be lost but given my lack of experience in the area I cannot quite understand why, even after having read the cutadapt guide multiple times over.

I wonder what could be the reason behind not having detected the primers in 50% of total reads and what the effect would be of retaining those untrimmed sequences. Perhaps you could help me understand this a little better. I am also keen to learn what you think of the different outputs and what looks more ‘acceptable’ of an outcome to your more experienced eyes.

Thank you again for your guidance. As you can probably tell, I need plenty! Looking forward to hearing back from you!

Jessica

1 Like

The likely reason you are loosing so many reads is due to excessive mismatches as a result of IUPAC ambiguity codes in your primers. Try adding the --p-match-adapter-wildcards flag. More information can be found here:

1 Like

FYI, the relevant part of the discussion linked to above was moved to Cutadapt failing to remove all primers.

Hi @SoilRotifer and @andrewsanchez,

I have actually tried including those flags but again, the output changes marginally.

cutadapt-output.zip (19.1 KB)

Interestingly enough, I have been in discussion with a colleague and he suggested that I try the commands for linked adapters. Having done so, it would appear that he was right and I managed to improve the primer detection rate for the forward reads substantially but not so much for the reverse (still only detected in about 50% of reverse reads). As I do not want to retain untrimmed reads, I am left at the end again with about half my read count per sample. I wonder though if this should provide any tells about what the problem might be here?

link-trimmed.zip (81.5 KB)

Attached in the second .zip file are the log files for the first run I tried with linked adapters on default parameters and the second with the added flags to match adapter and read wildcards.

I realize this might be, at this point, quite a drawn out question and I apologize for that but I worry I may have exhausted myself of all ideas as to what the issue might be. For that, I am grateful for any insight you can provide me!

Thanks again!

Jessica

1 Like

This is quite perplexing indeed. :thinking:

Usually when I see such consistent loss across all the samples, it usually implies that the primer sequences are slightly incorrect. Or something else is consistently wrong… :man_shrugging: What sequencing protocol was used to generate this data? Do you have a reference?

These primers are identical to the ones below, except for the T in place of K in the reverse primer. Which should not be an issue as K == G or T (Herlemann et al. 2011).

  • Bakt_341F: CCTACGGGNGGCWGCAG
  • Bakt_805R: GACTACHVGGGTATCTAATCC

I would rerun with the following changes. I know you’ve tried much of this already, I am only adding 2 new ones:

  • try using the --p-anywhere-f instead of --p-front-f
  • try using the --p-anywhere-r instead of --p-front-r
  • do not use the --no-indels
  • remove the ^ from the beginning.
  • keep --p-match-adapter-wildcards
  • keep --p-match-read-wildcards
  • keep --p-discard-untrimmed

Sometimes odd things happen with sequencing. I think at one point I simply had to trim some bases off either end of the primers… but you’d think this should be solved by simply increasing --p-error-rate as you’ve done. Otherwise, if you are not using staggered primers (Lundberg et al. 2013), then you can forgo cutadapt and just use the trimming options provided in dablur or DADA2. But it is nice to use cutadapt as a form of quality control.

Would it be possible to share the demuxed data? You can DM me if you do not want to share it publicly.

5 Likes

Hi @SoilRotifer,

Thank you for walking me through this. I have tried out the settings that you’ve suggested and sent the files as per your request through a dm. It would appear, as you can see, that the output did not change much. I have also tried just trimming the primers with dada2 but having tried it both with and without cutadapt, trimming only with dada2 seems to result in approximately 50% of reads being classified downstream as Unidentified Bacteria.

Given that the ‘linked adapters’ setup with cutadapt has given me the most promising output of all the different configurations, would you say it is safe to proceed with that but without discarding untrimmed reads? I am uncomfortable with retaining these untrimmed reads, mostly because I am not sure what that could introduce to downstream analyses.

What do you think?

Jessica

Good news @jessica.song! I think I solved this. :fireworks:

First, thank you for sending me the PDF describing the protocol. I noticed a key statement:

  • pairs of primers (Fw-Rev or Rev-Fw) had to be present in the sequence fragments

That is, this particular sequencing facility must expect mixed orientation reads. :rotating_light: Therefore you need to enter in both primers for each of the --p-front-* commands (in 5' - 3' orientation). :dna:
Like so:

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences paired-end-demux.qza \
  --p-cores 4 \
  --p-front-f CCTACGGGNGGCWGCAG  GACTACHVGGGTATCTAAKCC \
  --p-front-r GACTACHVGGGTATCTAAKCC  CCTACGGGNGGCWGCAG \
  --p-overlap 3 \
  --p-error-rate 0.1 \
  --p-match-read-wildcards \
  --p-match-adapter-wildcards \
  --p-discard-untrimmed \
  --o-trimmed-sequences primer-trimmed-demux-2.qza \
  --verbose > cutadapt-log-2.txt

The output from the first sample looks like this:

=== Summary ===
Total read pairs processed: 133,428
Read 1 with adapter: 133,350 (99.9%)
Read 2 with adapter: 132,063 (99.0%)
Pairs written (passing filters): 131,987 (98.9%)
Total basepairs processed: 77,399,028 bp
Read 1: 38,699,572 bp
Read 2: 38,699,456 bp
Total written (filtered): 71,548,932 bp (92.4%)
Read 1: 35,774,760 bp
Read 2: 35,774,172 bp

:boom:
-Cheers!
-Mike

7 Likes

@jessica.song, I forgot to mention, that you’ll still need to orient :left_right_arrow: this output into the same direction prior to any other downstream analyses. You can do this via the RESCRIPt action orient-seqs.

-Good luck! :shamrock:

2 Likes

Hi @SoilRotifer,

You did it! I just ran the command on all my samples and it looks perfect! Terrible oversight on my part. I cannot thank you enough, for your time and for providing the commands! :pray: :pray: :pray:

Jessica

4 Likes

Glad we could help! :smiley:

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.