trimming sequencing adapter, barcodes, pad, link, and primer sequence

Hello,

I am facing the problem of redoing all of my analysis since I did not perform proper “clean-up” to my data. I run a 2x250bp Miseq paired ends run, amplified V4 region using Caporasso primer sets. Could anyone let me know:

  1. If my 16s raw data needs trimming of sequence adapters, barcodes, pad, linker, and primer sequence ? when I got the data from sequencing center, I was told sequencing adapter has NOT been trimmed off.
  2. If trimming is necessary, how this is normally done in QIIME2 ? what do the dada2 and deblur actually do? is “denoising” equivalent to “quality control and filtering” step?
  3. 16s primers are degenerate primers that contains N, V. M. Will this be a problem when it comes to trimming?
  4. How do I evaluate the quality of trimming?
  5. If a relatively big chunk of sequences (~20-30 bp) in the middle of the amplicon have low scores, should I cut that off directly and proceed?

The relevant topics showed up while I tying my questions, and I did read one by one. But I think I am asking something very basic. I was not able to find answers from there. I appreciate if anyone could help me with these. Thank you very much.

All the best!

Hi!

I suppose that barcodes were removed by sequencing center and now you need to remove adapters and primers from the reads, you can do it with trimmomatic plugin.

Yes

In my experience it was better to replace all those symbols by “N”.

You can visualize output as a qzv file. Also you can manually check output files for adapter sequences.

It depends. If after removing them you will still have enough length for overlapping and quality is low, you can remove it and proceed.

Thanks for that, @timanix.

I am not aware of a QIIME 2 plugin for trimmomatic (can you share a link here?) but, @arlandan can use q2-cutadapt, instead.

1 Like

I am always confusing this names, sorry

1 Like

Hi @timanix,
Thanks so much for your answers. They solved my concerns.

Can you please give more details on this? How are they replaced and in which file? Should I replace the bases manually in each .gz file ? That sounds a a bit overwhelming…I am sorry, I might be asking something silly.

Can I proceed if the quality is low for the overlapped region?

Thank you again. Much appreciated.

Hi @thermokarst,

Thank you very much for helping me with this. I am going to give it a try.
Does that mean trimmomatic is an independent tool for cutting sequences and does not work in QIIME2 environment?

Yep!

It might work in a QIIME 2 environment, but, there is no QIIME 2 plugin for working with it.

Nope, I changed all this symbols on N in the adapter sequence. I don’t know if it is necessary, just I received a better output with my data.

Just check that if you will delete this region, will you still have enough length for overlapping your reads

I wrote Trimmomatic but it was a mistake and I wanted to write Cutadapt

Alright. Thank you @thermokarst for your timely help.
You all are so responsive.

1 Like

Hi @timanix,
Thank you very much for your timely reply. I will let you know once I am done with this. Greatly appreciated.

Best

Hi @thermokarst,

Can I have one more question?
If I understand correctly, dada2 and q2-cutadapt are both cutting sequences. does q2-cutadapt specifically cut adapters only and dada2 can cut whatever specified region? Thank you!

Hi @arlandan, yes, q2-cutadapt targets and removes substrings of nts — adapters, barcodes, primers, etc. q2-cutadapt’s trimming functionality is limited to x-length regions at either the 3’ or 5’ end.

[quote=“timanix, post:8, topic:10301, full:true”]

Nope, I changed all this symbols on N in the adapter sequence. I don’t know if it is necessary, just I received a better output with my data

Hi Timanix, not sure what does better output mean? Can you explain more? According to cutadapt, Adapter sequences may also contain any IUPAC wildcard character (https://cutadapt.readthedocs.io/en/stable/guide.html#wildcards)

Hi! Yes, I read the manual on the cutadapt home page.
And still I found few reads in which primers were not removed and they gone after I replaced wildcards on “N”. But I did not check if the letter I had in this reads was properly corresponded to the IUPAC symbol in my primers, so it can be either due to the errors in primers or in PCR.

Thank you @thermokarst
I ran following commands hoping to cut the adapters and primers.

qiime cutadapt trim-paired
  --i-demultiplexed-sequences /Users/Kan/Desktop/qiime2_2/pilot_imported.qza \
  --p-front-f GTGCCAGCNGCCGCGGTAA \
  --p-adapter-f ATTAGANACCCNNGTAGTCC \
  --p-front-r GGACTACNNGGGTNTCTAAT \
  --p-adapter-r TTACCGCGGCNGCTGGCAC \
  --p-error-rate 0 \
  --o-trimmed-sequences pilot_trimmed.qza \
  --output-dir /Users/Kan/Desktop/qiime2_2 \
  --verbose

But I have been getting following errors:

  Search demultiplexed paired-end sequences for adapters and remove them.
  The parameter descriptions in this method are adapted from the official
  cutadapt docs - please see those docs at https://cutadapt.readthedocs.io
  for complete details.

Options:
  --i-demultiplexed-sequences ARTIFACT PATH SampleData[PairedEndSequencesWithQuality]
                                  The paired-end sequences to be trimmed.
                                  [required]
  --p-cores INTEGER RANGE         Number of CPU cores to use.  [default: 1]
  --p-adapter-f MULTIPLE TEXT     Sequence of an adapter ligated to the 3'
                                  end. The adapter and any subsequent bases
                                  are trimmed. If a `$` is appended, the
                                  adapter is only found if it is at the end of
                                  the read. Search in forward read. If your
                                  sequence of interest is "framed" by a 5' and
                                  a 3' adapter, use this parameter to define a
                                  "linked" primer - see
                                  https://cutadapt.readthedocs.io for complete
                                  details.  [optional]
  --p-front-f MULTIPLE TEXT       Sequence of an adapter ligated to the 5'
                                  end. The adapter and any preceding bases are
                                  trimmed. Partial matches at the 5' end are
                                  allowed. If a `^` character is prepended,
                                  the adapter is only found if it is at the
                                  beginning of the read. Search in forward
                                  read.  [optional]
  --p-anywhere-f MULTIPLE TEXT    Sequence of an adapter that may be ligated
                                  to the 5' or 3' end. Both types of matches
                                  as described under `adapter` and `front` are
                                  allowed. If the first base of the read is
                                  part of the match, the behavior is as with
                                  `front`, otherwise as with `adapter`. This
                                  option is mostly for rescuing failed library
                                  preparations - do not use if you know which
                                  end your adapter was ligated to. Search in
                                  forward read.  [optional]
  --p-adapter-r MULTIPLE TEXT     Sequence of an adapter ligated to the 3'
                                  end. The adapter and any subsequent bases
                                  are trimmed. If a `$` is appended, the
                                  adapter is only found if it is at the end of
                                  the read. Search in reverse read. If your
                                  sequence of interest is "framed" by a 5' and
                                  a 3' adapter, use this parameter to define a
                                  "linked" primer - see
                                  https://cutadapt.readthedocs.io for complete
                                  details.  [optional]
  --p-front-r MULTIPLE TEXT       Sequence of an adapter ligated to the 5'
                                  end. The adapter and any preceding bases are
                                  trimmed. Partial matches at the 5' end are
                                  allowed. If a `^` character is prepended,
                                  the adapter is only found if it is at the
                                  beginning of the read. Search in reverse
                                  read.  [optional]
  --p-anywhere-r MULTIPLE TEXT    Sequence of an adapter that may be ligated
                                  to the 5' or 3' end. Both types of matches
                                  as described under `adapter` and `front` are
                                  allowed. If the first base of the read is
                                  part of the match, the behavior is as with
                                  `front`, otherwise as with `adapter`. This
                                  option is mostly for rescuing failed library
                                  preparations - do not use if you know which
                                  end your adapter was ligated to. Search in
                                  reverse read.  [optional]
  --p-error-rate FLOAT            Maximum allowed error rate.  [default: 0.1]
  --p-indels / --p-no-indels      Allow insertions or deletions of bases when
                                  matching adapters.  [default: True]
  --p-times INTEGER RANGE         Remove multiple occurrences of an adapter if
                                  it is repeated, up to `times` times.
                                  [default: 1]
  --p-overlap INTEGER RANGE       Require at least `overlap` bases of overlap
                                  between read and adapter for an adapter to
                                  be found.  [default: 3]
  --p-match-read-wildcards / --p-no-match-read-wildcards
                                  Interpret IUPAC wildcards (e.g., N) in
                                  reads.  [default: False]
  --p-match-adapter-wildcards / --p-no-match-adapter-wildcards
                                  Interpret IUPAC wildcards (e.g., N) in
                                  adapters.  [default: True]
  --o-trimmed-sequences ARTIFACT PATH SampleData[PairedEndSequencesWithQuality]
                                  The resulting trimmed sequences.  [required
                                  if not passing --output-dir]
  --output-dir DIRECTORY          Output unspecified results to a directory
  --cmd-config FILE               Use config file for command options
  --verbose                       Display verbose output to stdout and/or
                                  stderr during execution of this action.
                                  [default: False]
  --quiet                         Silence output if execution is successful
                                  (silence is golden).  [default: False]
  --citations                     Show citations and exit.
  --help                          Show this message and exit.

Error: Missing option: --i-demultiplexed-sequences
Error: Missing option: --o-trimmed-sequences
Note: When only providing names for a subset of the output Artifacts or
Visualizations, you must specify an output directory through use of the
--output-dir DIRECTORY flag.

I added the --output-dir flag, but still getting this. Does anyone could advise me on this? Thanks again!

This is an easy one - you are missing the \ on your first line:

qiime cutadapt trim-paired

vs

qiime cutadapt trim-paired \

OMG…Thank so much for your instant response! sharp eyes :cold_face:

1 Like

Hi @thermokarst,
Thanks again for the previous reply. Could you please let me know if there is any link or tutorial on the interpretation of the primer/adapter trimming results. I am having a hard time understanding the outputs. It is super long, but in a certain format, reaping something. It looks like my sequences do not need trimming, which is the only thing I can tell from those results given back. But please correct me if I am wrong. While, I still want to know more if anything is missed. Thank you very much!

Here is an example of the results after I entered the commands mentioned above:

=== Summary ===

Total read pairs processed:              3,991
  Read 1 with adapter:                       3 (0.1%)
  Read 2 with adapter:                       0 (0.0%)
Pairs written (passing filters):         3,991 (100.0%)

Total basepairs processed:     2,003,482 bp
  Read 1:     1,001,741 bp
  Read 2:     1,001,741 bp
Total written (filtered):      2,003,473 bp (100.0%)
  Read 1:     1,001,732 bp
  Read 2:     1,001,741 bp

=== First read: Adapter 1 ===

Sequence: ATTAGANACCCNNGTAGTCC; Type: regular 3'; Length: 20; Trimmed: 0 times.

=== First read: Adapter 2 ===

Sequence: GTGCCAGCNGCCGCGGTAA; Type: regular 5'; Length: 19; Trimmed: 3 times.

No. of allowed errors:
0-19 bp: 0

Overview of removed sequences
length	count	expect	max.err	error counts
3	3	62.4	0	3

=== Second read: Adapter 3 ===

Sequence: TTACCGCGGCNGCTGGCAC; Type: regular 3'; Length: 19; Trimmed: 0 times.

=== Second read: Adapter 4 ===

Sequence: GGACTACNNGGGTNTCTAAT; Type: regular 5'; Length: 20; Trimmed: 0 times.

No problem!

Your best option is https://cutadapt.readthedocs.io/en/stable/ — we don’t develop cutadapt (only q2-cutadapt, the QIIME 2 plugin). There should be a bit of discussion in those docs about how the program works, and how to interpret the log results. Hope that helps!

Thank you @thermokarst for sharing the link!

1 Like