Trim/trunc length for ITS

Fabs · July 18, 2018, 9:17pm

Yes, they are already demultiplexed.

Running everything now so hopefully, I won't bug you guys as much anymore. I will be submitting my graphs, just to verify the trunc/trim parameters afterwards just so that I can feel confortable with what I am doing.

Fabs · July 19, 2018, 12:40am

Hi Nicholas,

So, after running cutadapt, I received the error message below. Can you please let me know if I do, in fact, need to add the T to it? and why?

WARNING:
** One or more of your adapter sequences may be incomplete.**
** Please see the detailed output above.**

=== Second read: Adapter 4 ===

Sequence: TTACTTCCTCTAAATGACCAAG; Type: regular 3'; Length: 22; Trimmed: 17038 times.

No. of allowed errors:
0-22 bp: 0

Bases preceding removed adapters:
A: 0.9%
C: 0.3%
G: 0.1%
T: 98.7%
none/other: 0.0%
WARNING:
The adapter is preceded by "T" extremely often.
The provided adapter sequence may be incomplete.
To fix the problem, add "T" to the beginning of the adapter sequence.

Fabs · July 19, 2018, 1:43am

Update: I tried adding the extra T, and it made no difference. I received the same error.

thermokarst · July 19, 2018, 3:23am

Yep!

Please copy and paste the entire error message, and please provide the complete command you ran. Thanks!

Fabs · July 19, 2018, 3:43am

I have added an uploaded version of the output containing with the errors, as it is quite long, but the error continuously occurs with adapter #4.
qiime2_cutadapt-error_ITS-seqs.txt (117.5 KB)

Copy of Error for Adaptor 4.
Sequence: TTACTTCCTCTAAATGACCAAG; Type: regular 3'; Length: 22; Trimmed: 17836 times.

No. of allowed errors:
0-22 bp: 0

Bases preceding removed adapters:
A: 0.4%
C: 0.3%
G: 0.1%
T: 99.2%
none/other: 0.0%

WARNING:
** The adapter is preceded by "T" extremely often.**
** The provided adapter sequence may be incomplete.**
** To fix the problem, add "T" to the beginning of the adapter sequence.**

Overview of removed sequences

len count expect max.err error counts
3 136 488.0 0 136
4 42 122.0 0 42
5 20 30.5 0 20
6 11 7.6 0 11
7 134 1.9 0 134
8 42 0.5 0 42
9 29 0.1 0 29
10 6 0.0 0 6
11 4 0.0 0 4
12 2 0.0 0 2
13 729 0.0 0 729
14 13 0.0 0 13
15 27 0.0 0 27
16 10 0.0 0 10
17 1 0.0 0 1
18 73 0.0 0 73
20 12 0.0 0 12
21 9 0.0 0 9
22 2 0.0 0 2
25 24 0.0 0 24
26 11 0.0 0 11
27 59 0.0 0 59
28 41 0.0 0 41
29 2 0.0 0 2
30 5 0.0 0 5
31 1 0.0 0 1
32 516 0.0 0 516
33 13 0.0 0 13
36 41 0.0 0 41
37 54 0.0 0 54
39 3 0.0 0 3
40 128 0.0 0 128
41 37 0.0 0 37
42 14 0.0 0 14
43 111 0.0 0 111
44 5 0.0 0 5
45 250 0.0 0 250
46 69 0.0 0 69
47 22 0.0 0 22
48 58 0.0 0 58
49 4 0.0 0 4
50 5 0.0 0 5
51 13 0.0 0 13
52 5 0.0 0 5
53 22 0.0 0 22
54 88 0.0 0 88
55 185 0.0 0 185
56 22 0.0 0 22
57 294 0.0 0 294
58 147 0.0 0 147
59 35 0.0 0 35
60 702 0.0 0 702
61 194 0.0 0 194
62 240 0.0 0 240
63 18 0.0 0 18
64 4296 0.0 0 4296
65 77 0.0 0 77
66 440 0.0 0 440
67 443 0.0 0 443
68 540 0.0 0 540
69 64 0.0 0 64
70 23 0.0 0 23
71 27 0.0 0 27
72 39 0.0 0 39
73 450 0.0 0 450
74 98 0.0 0 98
75 139 0.0 0 139
76 874 0.0 0 874
77 69 0.0 0 69
78 961 0.0 0 961
79 1138 0.0 0 1138
80 251 0.0 0 251
81 448 0.0 0 448
82 320 0.0 0 320
83 114 0.0 0 114
84 28 0.0 0 28
85 120 0.0 0 120
86 58 0.0 0 58
87 2 0.0 0 2
89 1 0.0 0 1
90 7 0.0 0 7
91 7 0.0 0 7
92 9 0.0 0 9
93 6 0.0 0 6
94 20 0.0 0 20
95 17 0.0 0 17
96 25 0.0 0 25
98 2 0.0 0 2
99 1 0.0 0 1
100 164 0.0 0 164
101 8 0.0 0 8
102 50 0.0 0 50
103 2 0.0 0 2
104 440 0.0 0 440
105 285 0.0 0 285
106 14 0.0 0 14
107 151 0.0 0 151
108 1 0.0 0 1
110 18 0.0 0 18
111 5 0.0 0 5
113 2 0.0 0 2
114 6 0.0 0 6
116 9 0.0 0 9
118 2 0.0 0 2
121 41 0.0 0 41
122 22 0.0 0 22
123 3 0.0 0 3
124 46 0.0 0 46
125 6 0.0 0 6
126 4 0.0 0 4
127 3 0.0 0 3
129 3 0.0 0 3
131 36 0.0 0 36
132 1 0.0 0 1
133 35 0.0 0 35
134 212 0.0 0 212
135 123 0.0 0 123
136 90 0.0 0 90
137 24 0.0 0 24
138 85 0.0 0 85
139 3 0.0 0 3
140 21 0.0 0 21
141 14 0.0 0 14
142 22 0.0 0 22
144 2 0.0 0 2
149 1 0.0 0 1
150 28 0.0 0 28

WARNING:
One or more of your adapter sequences may be incomplete.
Please see the detailed output above.

thermokarst · July 19, 2018, 12:26pm

Thanks for sending that, @Fabs! Not to put too fine a point on things, but I don't see an error, just a warning message from cutadapt - in fact, there is a success save message at the bottom of your log:

Saved SampleData[PairedEndSequencesWithQuality] to: /media/sf_Desktop/NGSPractice/cassava-18-paired-end-demultiplexed-1/trimmed_sequences.qza

That sounds like things technically worked to me. Now, the warning message from cutadapt:

WARNING:
    One or more of your adapter sequences may be incomplete.
    Please see the detailed output above.

This sounds like cutadapt is trying to be helpful. You mentioned the warning is still there when you add the T - what does the new warning say? I imagine you could keep adding and adding nts until the warning goes away, but, really you or your sequencing center would know best about what barcodes and primers are still present in the reads - cutadapt is just trying to warning you that you might not have trimmed everything.

Fabs · July 19, 2018, 6:34pm

I see, sorry I guess I took the warning as an error since not everything was "removed". After adding the extra T, the exact same warning came up, suggesting I add another T.

So you think I should contact them and see what they think? Or could I assume they are technically okay, and this will be removed further down the pipeline when I trim/trunc?

Once again, thank you

Nicholas_Bokulich · July 19, 2018, 6:50pm

No need to contact them — the warning is just a suggestion. You know what your adapters were, so you have much better information that the computer in this case.

Yes, everything is okay. That warning makes complete sense here — the 16S rRNA gene is fairly well conserved and the primers are nestled in highly conserved sites so having highly conserved bases next to the primer/adapter sequence is not a surprise.

As far as I know, cutadapt is not specific to microbiome studies, so this warning may be there for more general uses, e.g., whole genome sequencing where highly conserved bases next to the adapter would rightly set red flags flying!

I hope that helps!

Fabs · July 20, 2018, 12:52am

Perfect, I am actually working with the ITS1F/ITS2 regions.

As previously mentioned, I did want to run the graphs to choose a trunc/trim length by you guys, just to feel better about my decision. I have attached the information below and would really love some input.

Based on this graphs and the phred scores, if I am understanding them correctly, I choose the following values. I feel more confident about the my forward read but for my reverse reads I am not 100% confident. Given that the values are dropping, they are dropping at a steady pace, up until ~298, so I want to trim at btwn 290-298 but I don't know if I need to be more conservative.
Again, your input is highly appreciated.

Forward
--p-trim-left = 6
--p-trunc-len =299

Reverse
--p-trim-left = 6
--p-trunc-len = 290- 298?

OUTPUT SUMMARY.QZV
Demultiplexed sequence counts summary

Minimum: 1 (only 1 sample, the next min = 4844
Median: 19601.0
Mean: 21193.5654206
Maximum: 55062
Total: 4535423

Foward reads

*Zoom in Foward reads

Reverse reads

Zoom in on reverse reads

Thanks again

Fabs · July 20, 2018, 6:45am

Hi Nicholas,

Additional question.

I decided to run the trunc/trim as 6/299 for forward reads and 6/290 for the reverse reads and after looking at table.qzv, I noticed that I have terribly low frequencies.( See below)

I am running 2 other parameters to see if maybe I needed to trim more, to remove added noise (currently running), but for the meantime I wanted to ask if you had any idea of what this could mean.

Note: I did read another post where it states that for ITS data, after DADA2 I would run filter-table-feature-filter (is this correct) and would this be okay to do with my samples? Link here (ITS) and here (using filter table-feature-filter on low frequency table.qzv)

Based on this information, could you tell me what is the best step for me to take and or refer me to another link?

Nicholas_Bokulich · July 20, 2018, 12:13pm

oh right of course! reading/writing too fast.

same idea, though: the ITS primers are couched in highly conserved sections of 18S and 5.8S rRNA genes that flank the ITS.

Those look like very well-chosen settings, given the high quality of your data.

There is no reason to do that with your workflow. That link you provided was specific to that user's workflow — they are exporting their data and running through an external program before re-importing, so the filtering was done to remove dropped features if I understand correctly. (and an ITSx-like method ITSxpress is also available in a 3rd-party plugin now so there is no reason to follow that workflow any more).

could you please share the actual qzv files? this will be easier to peruse. Please also send a QZV of the dada2 stats. I suspect you probably have a read merging issue; the second link you sent above is probably the most relevant for troubleshooting, though your data quality looks very high.

One problem is that (if I recall correctly) ITS1F (fungi-specific, I'm assuming F does not just stand for "forward" and this is the standard ITS1 primer) sits pretty far back in the 18S, so your pairs might not be quite long enough to bridge the full ITS region. Do you want to figure out the expected amplicon length?

Adam_Rivers · July 20, 2018, 1:42pm

ITSxpress is now available as a Qiime plugin.

To clarify, ITSxpress is not the same software as ITSx.

The two programs work a bit differently. ITSxpress merges reads, and clusters them temporarily at 99.5% identity by default (exact dereplication is also an option) to identify the start and stop sites. Those sites are then applied to all reads in the cluster. Importantly it yields FASTQ files that can be use by Deblur, Dada2 etc. rather than FASTAs.

On a typical 4 core machine ITSxpress is about 23X faster than ITSx for an ITS2 soil sample and 14X faster for an ITS1 soil sample. With default 99.5% clustering, ITSx and ITSxpress trim 99.8% of ITS1 sequences and 99.1% of ITS2 sequences within 2 bases of each other. With 100% clustering ITSXpress is is 6-9x faster and trims 99.99% of ITS1 and 99.86% of ITS2 sequences within 2 bases of each other.

Nicholas_Bokulich · July 20, 2018, 2:45pm

Thanks @Adam_Rivers! I was sort of aware from reading the github page that ITSxpress is intended for fastq data and does much more than ITSx but I did not know how much more! I have edited my response above.

Fabs · July 20, 2018, 8:12pm

Hi Nicholas,

I am attaching the QZV files requested, the demux, trimmed and DADA2.

demultiplexed Illumina files
demux-summary.qzv (300.8 KB)

#Trimmed summary _removal of 3'end primers
trimmed_summary.qzv (306.2 KB)

DADA2 qzv files using 6/299 and 6/290 trim parameters.
table.qzv (375.6 KB)
rep-seqs.qzv (355.0 KB)
denoising-stats.qzv (1.2 MB)

How can I figure out the expected amplicon length, and would this then be my trunc length? I do have an article, here, where it gives me an idea of amplicon lenght ~230bp.

Thank you again

Fabs · July 20, 2018, 8:17pm

Thank you for the update, I wasn't sure if I needed to do this, since the tutorial did not mention it, but thanks to your response and Nicholas I know better now.

Nicholas_Bokulich · July 20, 2018, 8:27pm

it looks like the denoising stats did not attach properly (I cannot view). Could you try re-sending or upload to dropbox?

It looks like you are running an earlier version of :qiime2:, and so you don't have the length summaries in demux-summary.qzv. Maybe you found another way to sort out your earlier question but I just wanted to clarify in case you were wondering that my advice about length distributions is based on the most recent release of QIIME 2 when this feature was added: