Merging (or not) ITS2 amplicons with variable length

Hi!
I am trying to analyze the fungal amplicons, but I have troubles to find an optimal way to do it.
The primers used - ITS86F/ITS4
MiSeq 2x300
(qiime2-2018.11, conda installation)

In my pipeline I import the sequences, then cut the primers ensuring that for the shorter amplicons the non-biological ends are cut off. Then I have tried either using ITSxpress, as well as without, then merging the files with the trunc-len 0, but adding the --p-trunc-q 20 (or without). With each variation I get different number of reads kept in different stages (denoising/ merging).
In the data there can be sequences that are not merging as the amplicons were too long so there's no overlap. I have yet not found any example in my data, but theoretically this can be the case.
First question : Is there any possibility to get from dada2 denoise paired as an output as well files with reads that didn't merge (so that for the longer amplicons only forward read could be used)?

When I compare the percentage of reads kept after denoising (see attached file), it looks like some samples loose many sequences when denoising, others only when merging.
Test_ITSX_cut_denoise.tsv (2.5 KB)

Sample LB145 gets better merged when the --p-trunc-q 20 is not included, which makes me think that maybe there are many longer amplicons and with the quality trimming they loose the necessary overlap, but when I look at the taxonomy file from the only forward file (where most reads are kept), then the major taxa is Wallemia muriae, which has rather short ITS2 and should have a good overlap.
https://www.dropbox.com/s/1pg4sfhudszercv/LB145.qza?dl=0

I was hoping that someone with more experience could have a look and help to choose the best strategy or suggest more parameters or techniques to try.

Thank you in advance!!!

qiime cutadapt trim-paired
--i-demultiplexed-sequences seq_test.qza
--p-adapter-f CATATCAATAAGCGGAGGA
--p-adapter-r TCAAAGATTCGATGATTCAC
--p-cores 40
--verbose
--o-trimmed-sequences cut_seq_test.qza

cut_seq_test.qzv (297.3 KB)

qiime dada2 denoise-single
--i-demultiplexed-seqs cut_seq_test.qza
--p-trunc-len 0
--p-n-threads 40
--verbose
--output-dir dada2out_cut_Forw

cd dada2out_cut_Forw

qiime metadata tabulate
--m-input-file denoising_stats.qza
--o-visualization denoising_stats.qzv

qiime dada2 denoise-paired
--i-demultiplexed-seqs cut_seq_test.qza
--p-trunc-len-f 0
--p-trunc-len-r 0
--p-n-threads 40
--verbose
--output-dir dada2out_cut

cd dada2out_cut

qiime metadata tabulate
--m-input-file denoising_stats.qza
--o-visualization denoising_stats.qzv

qiime dada2 denoise-paired
--i-demultiplexed-seqs cut_seq_test.qza
--p-trunc-len-f 0
--p-trunc-len-r 0
--p-trunc-q 20
--p-n-threads 40
--verbose
--output-dir dada2out_cut_Q20

cd dada2out_cut_Q20

qiime metadata tabulate
--m-input-file denoising_stats.qza
--o-visualization denoising_stats.qzv

qiime itsxpress trim-pair-output-unmerged
--i-per-sample-sequences cut_seq_test.qza
--p-region ITS2
--p-taxa F
--p-threads 50
--verbose
--o-trimmed ITSX_seq_test.qza

qiime dada2 denoise-paired
--i-demultiplexed-seqs ITSX_seq_test.qza
--p-trunc-len-f 0
--p-trunc-len-r 0
--p-n-threads 40
--verbose
--output-dir dada2out_ITSX

cd dada2out_ITSX

qiime metadata tabulate
--m-input-file denoising_stats.qza
--o-visualization denoising_stats.qzv

qiime dada2 denoise-paired
--i-demultiplexed-seqs ITSX_seq_test.qza
--p-trunc-len-f 0
--p-trunc-len-r 0
--p-trunc-q 20
--p-n-threads 40
--verbose
--output-dir dada2out_ITSX_Q20

cd dada2out_ITSX_Q20

qiime metadata tabulate
--m-input-file denoising_stats.qza
--o-visualization denoising_stats.qzv

Good morning,

Thank you for your detailed description of your primers and your analysis strategy. Processing ITS data is especially hard, as the variable length of the region causes unexpected changes at different stages of the pipeline.

Let's start with the first question:

When I read the documentation for daad2 denoise paired, I don't see this option, so no this can't be done through Qiime 2. Maybe it could be done by using dada2 directly in R...

Speaking of using dada2 directly, have you read the DADA2 ITS workflow recommended by the developer? I think comparing your pipelines to his would be a really good place to start.
https://benjjneb.github.io/dada2/ITS_workflow.html


Now let's address your main question.

Now that is a very good question.

While we can compare different pipelines and their different results, the only way to choose the best strategy is if you have a positive control that is a mock community with a known composition. Then the best strategy is the one that most closely matches your expected result.

Without a mock community as a positive control, we are just comparing apples and oranges. :apple: :orange_heart:

Colin

1 Like

Hi Colin,

Thank you for your answer.
In the same run there were indeed some mock samples as well. I added them now to the previously tested samples. The mock community was made by mixing together in equal cell number 13 different yeast and extracting the DNA with the same kit as used for the rest of the samples.
The amplicon length for the mock community is ranging between 150 to 369 bp - so should normally be well covered by MiSeq 2x300 reads.
The highest percentage of reads passing for mock community are coming through when processing with ITSX and then adding the --p-trunc-q 20 when denoising. When looking at where the reads map though and comparing to the one where only forward reads are used - then the species with the longest amplicon (369) is the most abundant in forward only and almost nonexistent when using ITSX and later the --p-trunc-q 20. When the ITSX is used and --p-trunc-q 20 is not used in denoising, then the longest amplicon does appear in the table, but 10 times less abundantly compared to forward only.
I am considering to do all the analyses with forward only and then as well ITSX and without the extra --p-trunc-q 20 in denoising - and compare how much the end results differ. With forward only I think many more low-quality reads will pass and I will through away half of the data (reverse reads). While with merging there will be much less reads that will be in the final dataset and if already 369 bp amplicon seems to be long for merging ( the quality score average drops under 20 after 190 bp for the reverse reads), then I am not sure if I can trust the results.

Any extra advise on ITS/ variable length read analyses will be highly appreciated!
Thank you!

2 Likes

Hello,
I’m doing something similar but using fITS7/ITS4 which has an average amplicon size in my data of 268 bp (without primers) but ranges from under 200 to over 400 bp. So my reads should merge if the quality is good enough. I found using a --p-trunc-q 10 gave me the most reads although not much less than q 20. It took me a while to realise that the default --p-max-ee is 2, which means without any q filtering, all reads with expected errors over 2 (a lot of the reverse reads) will be removed. So the --p-trunc-q is applied first, removing those really bad end bits, and then any remaining with --p-max-ee is 2 are then removed (much less after this step). However, then the reads will become shorter meaning less will merge. I still get about 80% which I’m happy with as I know i’m not excluding over a certain size (just random low quality sequences throughout). But this could be a problem if you have a large amplicon and you could be excluding the larger end of species. Yours might just be a bit too large that you lose a lot at the merge stage. I would just stick with forward reads for your data. I’m also interested in ITSX and got similar read numbers (44k v 48k) using q 10. I ended up with a few more ASVs but I’m not sure how to really check which is best as I don’t have a mock community, so I’d be interested to find out how you get along!

3 Likes

Hello @rahel_park

:clap: :1st_place_medal: :fireworks: Well done! I'm so glad you included that control. This let's us answer LOTS of important questions.

This is one of the standing challenges with ITS reads. The variable length region can confuses pipelines that expected a constant length region like v4 16S.

This means that some reads can overlap by (300x2 - 369) = 231 overlap or totally overlap with overhang on both sides. This overhang with +100% overlapping reads can also confuse some joing steps. I'm not sure how well dada2 works with fully overlapping reads.

I wonder if this is due to the trunc option clipping the reads, so that they don't join and don't appear in the data. But even if you don't truncate your reads, they might still not join!! As I discovered here, the qiime2 dada2 plugin expects zero missmatches in the reads when joining. This would means a drop in quality could prevent joining, with or without the trunc option. :man_shrugging:

Maybe we should change the dada2 default to allow more than one missmatch so reads can actually join...


I really like this idea. You can use the mock community to choose settings that fully capture the diversity in your samples.

Let me know what you find!

Colin

1 Like

Hello @SarahH,

Thanks for qiime-ing in! This is definitely an issue that many people face with variable length ITS regions, especially regions that don't overlap very much.

I think you may have also get caught by the dada2 default to only merge reads that match exactly, but it's hard to know.

Good to mention the --p-max-ee flag. Outside any issues with joining, this flag is also likely to remove a lot of reads, especially if low quality ends have not been trimmed off.


:+1: I second this recommendation. But of course, it's best to try it both ways.

Colin

Hi @SarahH
The --p-trunc-q 10 made some miracles!!! After this all the test samples have 80-90 % reads that merge in the end! (Before in each different approach some samples were around 10%, most around 50-70%)
It’s good to have the mock community, but mine is not that perfect as I am not sure what is the most closest to the true result - the bias is introduced both by extraction kit as well as the primers. Another issue is that the mock community composition is not that representative of sample composition… In general the --p-trunc-q 10 gives me quite similar results for the mock community as just using the forward reads.
Thank you so much for the tip!!!

3 Likes

Thanks for the advice!
I was a bit surprised to read that dada2 might have as well issues with fully overlapping reads… It looks like many of my reads will be fully overlapping… And I guess it would be good to have an option to allow 1-2 mismatches in the read-joining - especially if there are very long overlapping regions, this would seem reasonable (maybe something incremental would be ideal - like 0 mismatch if the overlapping region is 10-20 bp, 1 mismatch for 20-40 bp overlap, etc…)
I am going to play around with the data a bit more and let you know what works the best. The --p-trunc-q 10 addition to denoise seems very promising :slight_smile:

2 Likes

Hi!

I tried to use just the forward reads for my fungi, but it doesn't work with the reads passed through ITSX (see error below). If I export the reads (remove the manifest file just in case) and reimport, then it works, as well as the reads that have been just passed though cutadapt only...
Does something change in the format or what goes wrong here?
Thanks in advance!

qiime dada2 denoise-single \

--i-demultiplexed-seqs ITSX_seq_$run.qza
--p-trunc-len 0
--p-n-threads 40
--verbose
--output-dir dada2out_ITSX_Forw_$run
Traceback (most recent call last):
File "/home/rahel/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/q2cli/commands.py", line 274, in call
results = action(**arguments)
File "", line 2, in denoise_single
File "/home/rahel/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/qiime2/sdk/action.py", line 225, in bound_callable
spec.view_type, recorder)
File "/home/rahel/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/qiime2/sdk/result.py", line 287, in _view
result = transformation(self._archiver.data_dir)
File "/home/rahel/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/qiime2/core/transform.py", line 70, in transformation
new_view = transformer(view)
File "/home/rahel/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/q2_types/per_sample_sequences/_transformer.py", line 98, in _5
absolute=False)
File "/home/rahel/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/q2_types/per_sample_sequences/_transformer.py", line 158, in _parse_and_validate_manifest
_validate_paired_end_fastq_manifest_directions(manifest)
File "/home/rahel/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/q2_types/per_sample_sequences/_transformer.py", line 240, in _validate_paired_end_fastq_manifest_directions
% ', '.join(forward_but_no_reverse))
ValueError: Forward and reverse reads must be provided exactly one time each for each sample. The following samples had forward but not reverse read fastq files: Blank23, SRA62, SR194, SR55, SRA75, SR224, SR228, Blank21, Blank28, SB37, SB104, SR234, SR384, SB208, SR404, SRA57, SB83, SRA49, SB214, SB133, SB310, SB126, SR245, SB154, SB171, SR191, SRA53, SRA52, SR108, Blank25, SRA38, SRA68, SB20, SRA39, SB72, SB46, SB94, SR440, SR219, Blank32, SB75, SRA21, SB283, SB202, SB145, SR351, SR412, Blank31, SR472, SRA61, SB44, SR307, SB80, SR352, SR426, SR413, SR241, SB109, Blank34, SB118, SB33, SR389, SR435, SB128, SR79, SRA31, Blank33b, SB98, SB14, SB304, SR265, SR248, MBmock1, SB262, SRA28, SRA36, SR255, SB264, SR460, SB92, SRA58, SR281, FNegative2, SRA59, SR317, SB276, SRA20, SB27, Blank33c, SB227, SRA14, MBmock2, SB303, SB279, FNegative3, SB66, SR491, SRA11, SB228, SR147, SR167, SR229, SRA73, SB105, SB84, SB141, SB91, SR256, SB41, Blank26, SB70, SR56, SR453, SRA51, SRA74, SB106, SB140, SB267, SB73, SR65, SB257, SR486, SRA67, SB107, SRA69, SRA29, SB224, SB100, SB139, SB8, SB307, SB152, SB82, SR174, SR179, SR468, MBmock4, SR100, Blank35, SR223, SB309, SR464, SB211, SB45, FNegative1, SR368, SR252, SRA44, SB49, SR312, SB149, SB108, SRA63, Blank24, SRA77, SRA23, SRA25, SB17, SB170, SB6, SR64, SRA46, SRA47, SB114, SB122, SR479, SB230, SB112, SB78, SB81, SR150, SR354, SR395, SB19, Blank27, SB212, Blank36, SB198, SB103, SB43, SB113, SB64, SRA76, SRA17, SR221, SRA65, Blank29, Blank30, SB115, SB260, SR231, SB77, SRA60, SB12, SRA48, SR411, SB89, Blank22, SB131, SB144, FNegative4, SB56, SB71

Plugin error from dada2:

Forward and reverse reads must be provided exactly one time each for each sample. The following samples had forward but not reverse read fastq files: Blank23, SRA62, SR194, SR55, SRA75, SR224, SR228, Blank21, Blank28, SB37, SB104, SR234, SR384, SB208, SR404, SRA57, SB83, SRA49, SB214, SB133, SB310, SB126, SR245, SB154, SB171, SR191, SRA53, SRA52, SR108, Blank25, SRA38, SRA68, SB20, SRA39, SB72, SB46, SB94, SR440, SR219, Blank32, SB75, SRA21, SB283, SB202, SB145, SR351, SR412, Blank31, SR472, SRA61, SB44, SR307, SB80, SR352, SR426, SR413, SR241, SB109, Blank34, SB118, SB33, SR389, SR435, SB128, SR79, SRA31, Blank33b, SB98, SB14, SB304, SR265, SR248, MBmock1, SB262, SRA28, SRA36, SR255, SB264, SR460, SB92, SRA58, SR281, FNegative2, SRA59, SR317, SB276, SRA20, SB27, Blank33c, SB227, SRA14, MBmock2, SB303, SB279, FNegative3, SB66, SR491, SRA11, SB228, SR147, SR167, SR229, SRA73, SB105, SB84, SB141, SB91, SR256, SB41, Blank26, SB70, SR56, SR453, SRA51, SRA74, SB106, SB140, SB267, SB73, SR65, SB257, SR486, SRA67, SB107, SRA69, SRA29, SB224, SB100, SB139, SB8, SB307, SB152, SB82, SR174, SR179, SR468, MBmock4, SR100, Blank35, SR223, SB309, SR464, SB211, SB45, FNegative1, SR368, SR252, SRA44, SB49, SR312, SB149, SB108, SRA63, Blank24, SRA77, SRA23, SRA25, SB17, SB170, SB6, SR64, SRA46, SRA47, SB114, SB122, SR479, SB230, SB112, SB78, SB81, SR150, SR354, SR395, SB19, Blank27, SB212, Blank36, SB198, SB103, SB43, SB113, SB64, SRA76, SRA17, SR221, SRA65, Blank29, Blank30, SB115, SB260, SR231, SB77, SRA60, SB12, SRA48, SR411, SB89, Blank22, SB131, SB144, FNegative4, SB56, SB71

See above for debug info.

Hello Rahel,

Here is the core of the error:

ValueError: Forward and reverse reads must be provided exactly one time each for each sample. The following samples had forward but not reverse read fastq files:

So, it looks like dada2 expects paired reads, which is strange because you are using denoise-single! I'm not sure what's going on there. Let's see what the qiime devs recommend. @Nicholas_Bokulich

Colin

Good sluething @colinbrislawn! The error is actually coming from q2-types, the format validation is making this complaint. @SarahH - it looks like you imported as SampleData[PairedEndSequencesWithQuality], when in reality you probably meant SampleData[JoinedSequencesWithQuality]. What does the following tell you:

qiime tools peek ITSX_seq_$run.qza
1 Like

Hi!
So qiime tools peek ITSX_seq_$run.qza gives:
UUID: feb98447-3069-417d-a2f6-b5c34ec4c270
Type: SampleData[PairedEndSequencesWithQuality]
Data format: SingleLanePerSamplePairedEndFastqDirFmt

I do the ITSXpress on the file that has gone though cutadapt. This file doesn’t give errors with denoise-single.

qiime tools peek cut_seq_5.qza
UUID: ef036045-13cb-4d8f-a07f-c06c0e786000
Type: SampleData[PairedEndSequencesWithQuality]
Data format: SingleLanePerSamplePairedEndFastqDirFmt

For itsxpress I use the command:
qiime itsxpress trim-pair-output-unmerged

The output (ITSX_seq_$run.qza) is fine for dada2 denoise-paired, but not for the denoise-single…

And in addition this file doesn’t work with summarize either:
qiime demux summarize
–i-data ITSX_seq_$run.qza
–o-visualization ITSX_seq_$run.qzv
Plugin error from demux:

list index out of range

Debug info has been saved to /home/rahel/TMPDIR_class/qiime2-q2cli-err-znnfgaka.log

cat /home/rahel/TMPDIR_class/qiime2-q2cli-err-znnfgaka.log
Traceback (most recent call last):
File “/home/rahel/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/q2cli/commands.py”, line 274, in call
results = action(**arguments)
File “”, line 2, in summarize
File “/home/rahel/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/qiime2/sdk/action.py”, line 231, in bound_callable
output_types, provenance)
File “/home/rahel/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/qiime2/sdk/action.py”, line 424, in callable_executor
ret_val = self._callable(output_dir=temp_dir, **view_args)
File “/home/rahel/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/q2_demux/_summarize/_visualizer.py”, line 158, in summarize
for file in link]
File “/home/rahel/miniconda3/envs/qiime2-2018.11/lib/python3.5/site-packages/q2_demux/_summarize/_visualizer.py”, line 158, in
for file in link]
IndexError: list index out of range

What could be happening with these files?

As I mentioned above, it looks like you don't actually have paired-end data anymore, but rather, joined-reads --- the semantic type you imported as is technically incorrect.

The visualizer is reading the semantic type of the file (in this case SampleData[PairedEndSequencesWithQuality]), but, the data contained within is actually single-ended (as in, joined paired-end reads, which boils down to dual orientation single-end reads), so the visualizer is breaking because it is looking for the files that contain the reverse reads, but can't find them.

1 Like

I went back and looked at the results you included in your original post - what does the contents of your manifest (ITSX_den_test.txt) look like? Are there any reverse reads specified?

Would you be able to share ITSX_seq_$run.qza (you can DM a link in private if you prefer) - it would be really helpful for us to look at the provenance, to help make sense of exactly what was done. Thanks!

1 Like

Yes, in the manifest files I have specified both forward and reverse reads.
I am a bit confused - if the reads are already joined then how the denoise-paired still work on it? And if I export the ITSxpress passed qza, then there are both R1 and R2 present... And the command for itsxpress starts "qiime itsxpress trim-pair-output-unmerged".

Here's the ITSX_seq_test.qza Dropbox - Error - Simplify your life
It's been treated the same, just less samples... gives the same errors..

Thank you for looking into it!!

2 Likes

Aha, I think we found the problem @rahel_park! It looks like pre-joined reads maybe aren't the case here, my mistake!

It looks like maybe q2-itsxpress (cc @Adam_Rivers) cooked up an invalid SampleData[PairedEndSequencesWithQuality] QZA. Check out the screenshot of the extracted archive, after being run through q2-itsxpress, as well as the MANIFEST (a file used internally by QIIME 2 to provide some bookkeeping about the files):

If you notice here, the MANIFEST is missing the reverse read entries. When you run dada2 denoise-single, an extra validation step is called when "viewing" the paired-end data as single-end data --- this validation step notices the missing reverse reads in the MANIFEST file.

One immediate option for you to move forward is export the file, delete MANIFEST and metadata.yml, and re-import the data. This will break provenance, but, will allow you to keep on moving.


Looking at the q2-itsxpress code, it looks like the MANIFEST writer currently ignores reverse reads when generating the MANIFEST. An easy shortcut here @Adam_Rivers is to CasavaOneEightSingleLanePerSampleDirFmt, instead of SingleLanePerSamplePairedEndFastqDirFmt. Doing this, you don't need to worry about building the MANIFEST file - that format knows how to build its own, based on the files present. An example of this in action can be found in q2-cutadapt. @Adam_Rivers - I am happy to write up an issue about this if you wish, just let me know!

2 Likes

Thanks for identifying problem!
I will export and re-import the files then.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.