I am facing the problem of redoing all of my analysis since I did not perform proper "clean-up" to my data. I run a 2x250bp Miseq paired ends run, amplified V4 region using Caporasso primer sets. Could anyone let me know:
If my 16s raw data needs trimming of sequence adapters, barcodes, pad, linker, and primer sequence ? when I got the data from sequencing center, I was told sequencing adapter has NOT been trimmed off.
If trimming is necessary, how this is normally done in QIIME2 ? what do the dada2 and deblur actually do? is "denoising" equivalent to "quality control and filtering" step?
16s primers are degenerate primers that contains N, V. M. Will this be a problem when it comes to trimming?
How do I evaluate the quality of trimming?
If a relatively big chunk of sequences (~20-30 bp) in the middle of the amplicon have low scores, should I cut that off directly and proceed?
The relevant topics showed up while I tying my questions, and I did read one by one. But I think I am asking something very basic. I was not able to find answers from there. I appreciate if anyone could help me with these. Thank you very much.
I suppose that barcodes were removed by sequencing center and now you need to remove adapters and primers from the reads, you can do it with trimmomatic plugin.
Yes
In my experience it was better to replace all those symbols by "N".
You can visualize output as a qzv file. Also you can manually check output files for adapter sequences.
It depends. If after removing them you will still have enough length for overlapping and quality is low, you can remove it and proceed.
Hi @timanix,
Thanks so much for your answers. They solved my concerns.
Can you please give more details on this? How are they replaced and in which file? Should I replace the bases manually in each .gz file ? That sounds a a bit overwhelming...I am sorry, I might be asking something silly.
Can I proceed if the quality is low for the overlapped region?
Thank you very much for helping me with this. I am going to give it a try.
Does that mean trimmomatic is an independent tool for cutting sequences and does not work in QIIME2 environment?
Can I have one more question?
If I understand correctly, dada2 and q2-cutadapt are both cutting sequences. does q2-cutadapt specifically cut adapters only and dada2 can cut whatever specified region? Thank you!
Hi @arlandan, yes, q2-cutadapt targets and removes substrings of nts --- adapters, barcodes, primers, etc. q2-cutadapt's trimming functionality is limited to x-length regions at either the 3' or 5' end.
Nope, I changed all this symbols on N in the adapter sequence. I don't know if it is necessary, just I received a better output with my data
Hi Timanix, not sure what does better output mean? Can you explain more? According to cutadapt, Adapter sequences may also contain any IUPAC wildcard character (User guide — Cutadapt 5.0 documentation)
Hi! Yes, I read the manual on the cutadapt home page.
And still I found few reads in which primers were not removed and they gone after I replaced wildcards on "N". But I did not check if the letter I had in this reads was properly corresponded to the IUPAC symbol in my primers, so it can be either due to the errors in primers or in PCR.
Search demultiplexed paired-end sequences for adapters and remove them.
The parameter descriptions in this method are adapted from the official
cutadapt docs - please see those docs at https://cutadapt.readthedocs.io
for complete details.
Options:
--i-demultiplexed-sequences ARTIFACT PATH SampleData[PairedEndSequencesWithQuality]
The paired-end sequences to be trimmed.
[required]
--p-cores INTEGER RANGE Number of CPU cores to use. [default: 1]
--p-adapter-f MULTIPLE TEXT Sequence of an adapter ligated to the 3'
end. The adapter and any subsequent bases
are trimmed. If a `$` is appended, the
adapter is only found if it is at the end of
the read. Search in forward read. If your
sequence of interest is "framed" by a 5' and
a 3' adapter, use this parameter to define a
"linked" primer - see
https://cutadapt.readthedocs.io for complete
details. [optional]
--p-front-f MULTIPLE TEXT Sequence of an adapter ligated to the 5'
end. The adapter and any preceding bases are
trimmed. Partial matches at the 5' end are
allowed. If a `^` character is prepended,
the adapter is only found if it is at the
beginning of the read. Search in forward
read. [optional]
--p-anywhere-f MULTIPLE TEXT Sequence of an adapter that may be ligated
to the 5' or 3' end. Both types of matches
as described under `adapter` and `front` are
allowed. If the first base of the read is
part of the match, the behavior is as with
`front`, otherwise as with `adapter`. This
option is mostly for rescuing failed library
preparations - do not use if you know which
end your adapter was ligated to. Search in
forward read. [optional]
--p-adapter-r MULTIPLE TEXT Sequence of an adapter ligated to the 3'
end. The adapter and any subsequent bases
are trimmed. If a `$` is appended, the
adapter is only found if it is at the end of
the read. Search in reverse read. If your
sequence of interest is "framed" by a 5' and
a 3' adapter, use this parameter to define a
"linked" primer - see
https://cutadapt.readthedocs.io for complete
details. [optional]
--p-front-r MULTIPLE TEXT Sequence of an adapter ligated to the 5'
end. The adapter and any preceding bases are
trimmed. Partial matches at the 5' end are
allowed. If a `^` character is prepended,
the adapter is only found if it is at the
beginning of the read. Search in reverse
read. [optional]
--p-anywhere-r MULTIPLE TEXT Sequence of an adapter that may be ligated
to the 5' or 3' end. Both types of matches
as described under `adapter` and `front` are
allowed. If the first base of the read is
part of the match, the behavior is as with
`front`, otherwise as with `adapter`. This
option is mostly for rescuing failed library
preparations - do not use if you know which
end your adapter was ligated to. Search in
reverse read. [optional]
--p-error-rate FLOAT Maximum allowed error rate. [default: 0.1]
--p-indels / --p-no-indels Allow insertions or deletions of bases when
matching adapters. [default: True]
--p-times INTEGER RANGE Remove multiple occurrences of an adapter if
it is repeated, up to `times` times.
[default: 1]
--p-overlap INTEGER RANGE Require at least `overlap` bases of overlap
between read and adapter for an adapter to
be found. [default: 3]
--p-match-read-wildcards / --p-no-match-read-wildcards
Interpret IUPAC wildcards (e.g., N) in
reads. [default: False]
--p-match-adapter-wildcards / --p-no-match-adapter-wildcards
Interpret IUPAC wildcards (e.g., N) in
adapters. [default: True]
--o-trimmed-sequences ARTIFACT PATH SampleData[PairedEndSequencesWithQuality]
The resulting trimmed sequences. [required
if not passing --output-dir]
--output-dir DIRECTORY Output unspecified results to a directory
--cmd-config FILE Use config file for command options
--verbose Display verbose output to stdout and/or
stderr during execution of this action.
[default: False]
--quiet Silence output if execution is successful
(silence is golden). [default: False]
--citations Show citations and exit.
--help Show this message and exit.
Error: Missing option: --i-demultiplexed-sequences
Error: Missing option: --o-trimmed-sequences
Note: When only providing names for a subset of the output Artifacts or
Visualizations, you must specify an output directory through use of the
--output-dir DIRECTORY flag.
I added the --output-dir flag, but still getting this. Does anyone could advise me on this? Thanks again!
Hi @thermokarst,
Thanks again for the previous reply. Could you please let me know if there is any link or tutorial on the interpretation of the primer/adapter trimming results. I am having a hard time understanding the outputs. It is super long, but in a certain format, reaping something. It looks like my sequences do not need trimming, which is the only thing I can tell from those results given back. But please correct me if I am wrong. While, I still want to know more if anything is missed. Thank you very much!
Here is an example of the results after I entered the commands mentioned above:
=== Summary ===
Total read pairs processed: 3,991
Read 1 with adapter: 3 (0.1%)
Read 2 with adapter: 0 (0.0%)
Pairs written (passing filters): 3,991 (100.0%)
Total basepairs processed: 2,003,482 bp
Read 1: 1,001,741 bp
Read 2: 1,001,741 bp
Total written (filtered): 2,003,473 bp (100.0%)
Read 1: 1,001,732 bp
Read 2: 1,001,741 bp
=== First read: Adapter 1 ===
Sequence: ATTAGANACCCNNGTAGTCC; Type: regular 3'; Length: 20; Trimmed: 0 times.
=== First read: Adapter 2 ===
Sequence: GTGCCAGCNGCCGCGGTAA; Type: regular 5'; Length: 19; Trimmed: 3 times.
No. of allowed errors:
0-19 bp: 0
Overview of removed sequences
length count expect max.err error counts
3 3 62.4 0 3
=== Second read: Adapter 3 ===
Sequence: TTACCGCGGCNGCTGGCAC; Type: regular 3'; Length: 19; Trimmed: 0 times.
=== Second read: Adapter 4 ===
Sequence: GGACTACNNGGGTNTCTAAT; Type: regular 5'; Length: 20; Trimmed: 0 times.
Your best option is Cutadapt — Cutadapt 5.0 documentation --- we don't develop cutadapt (only q2-cutadapt, the QIIME 2 plugin). There should be a bit of discussion in those docs about how the program works, and how to interpret the log results. Hope that helps!