itsxpress does not extract sequences where ITSx does

ewmorr · July 23, 2019, 8:16pm

I'm having a problem with itsxpress qiime2 plugin wherein no sequences are produced. This may be related to the questions here and here .

q2-itsxpress plugin AND standalone itsxpress produce no sequences in output file. However, replicating the analysis using vsearch and ITSx finds viable ITS2 sequences for 100% of vsearch merged and dereplicated sequences.

I've also previously processed this data using a separate pipepline (USEARCH/UPARSE and ITSx) and gotten reasonable results in terms of expected community membership (our target fungi were there)

Here is the workflow I used to run itsxpress within qiime2, and then confirming the results using standalone itsxpress and standalone vsearch and ITSx

Import data. These are two files from one sample that have already been trimmed of primers and filtered for expected errors.


qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path itsxpress_test \
--input-format CasavaOneEightSingleLanePerSampleDirFmt \
--output-path test.qza

qiime tools peek test.qza

qiime demux summarize --i-data test.qza --o-visualization test.qzv

Run itsxpress plugin


qiime itsxpress trim-pair-output-unmerged --i-per-sample-sequences test.qza --p-region ALL --p-taxa ALL --o-trimmed itsx_trimmed.qza

qiime demux summarize --i-data itsx_trimmed.qza --o-visualization itsx_trimmed.qzv --verbose

demux summarize gives an error.

Plugin error from demux:

Cannot describe a DataFrame without columns

See above for debug info.

There are no reads in the files.


tar -xf itsx_trimmed.qza

gunzip -c 05871b17-6917-425a-8b08-22e751e3bb55/data/test_1_L001_R1_001.fastq.gz | wc -l
	0
gunzip -c 05871b17-6917-425a-8b08-22e751e3bb55/data/test_1_L001_R2_001.fastq.gz | wc -l
	0

However, I have previously processed this dataset using a combination of USEARCH, and standalone ITSx. I know from this analysis that there are valid fungal ITS2 sequences in this data set. To confirm this I ran through standalone itsxpress, and standalone vsearch+ITSx to replicate the itsxpress process.

standalone itsxpress


itsxpress --fastq itsxpress_test/test_1_L001_R1_001.fastq.gz --fastq2 itsxpress_test/test_1_L001_R2_001.fastq.gz --outfile test_ITSx_R1.fastq --outfile2 test_ITSx_R2.fastq --log test_ITSx.log --region ALL --taxa 'Fungi'

output

Pairs:               	8912
Joined:              	8567       	96.129%
Ambiguous:           	342       	3.838%
No Solution:         	3       	0.034%
Too Short:           	0       	0.000%

Avg Insert:          	333.2
Standard Deviation:  	9.2
Mode:                	328

Insert range:        	280 - 426
90th percentile:     	346
75th percentile:     	338
50th percentile:     	330
25th percentile:     	328
10th percentile:     	325

...

Dereplicating 100%
Sorting 100%
1569 unique sequences, avg cluster 5.5, median 1, max 1281
Writing output file 100%
Writing uc file, first part 100%
Writing uc file, second part 100%

2019-07-23 11:29:39,354: INFO     Searching for ITS start and stop sites using HMMSearch. This step takes a while.
2019-07-23 11:29:41,269: INFO     Parsing HMM results.
2019-07-23 11:29:41,484: INFO     Writing out sequences
2019-07-23 11:29:42,847: INFO     ITSxpress ran in 00:00:05

There are no sequences in the output files...

wc -l test_ITSx_R1.fastq
0 test_ITSx_R1.fastq
wc -l test_ITSx_R2.fastq
0 test_ITSx_R2.fastq

There is not much informative (to me) in the log file aside from `DEBUG No ITS stop or start sites were identified for sequence ...`

Replicating the itsxpress pipeline as I understand it with standalone vsearch and ITSx

vsearch --fastq_mergepairs itsxpress_test/test_1_L001_R1_001.fastq.gz --reverse itsxpress_test/test_1_L001_R2_001.fastq.gz --fastaout test_ITSx_standalone.merged.fasta

output

Merging reads 100%
8912  Pairs
8853  Merged (99.3%)
59  Not merged (0.7%)

Pairs that failed merging due to various reasons:
41  too few kmers found on same diagonal
16  alignment score too low, or score drop to high
2  overlap too short

Statistics of all reads:
226.70  Mean read length

Statistics of merged reads:
333.19  Mean fragment length
9.11  Standard deviation of fragment length
0.17  Mean expected error in forward sequences
0.22  Mean expected error in reverse sequences
0.12  Mean expected error in merged sequences
0.09  Mean observed errors in merged region of forward sequences
0.11  Mean observed errors in merged region of reverse sequences
0.20  Mean observed errors in merged region

Derep with vsearch

vsearch --derep_fulllength test_ITSx_standalone.merged.fasta --output test_ITSx_standalone.merged.derep.fasta

output

2949765 nt in 8853 seqs, min 280, max 426, avg 333
Dereplicating 100%
Sorting 100%
1678 unique sequences, avg cluster 5.3, median 1, max 1315
Writing output file 100%

Run standalone ITSx

ITSx -i test_ITSx_standalone.merged.derep.fasta -o test_ITSx_standalone_derep --preserve T

ITSx runs to completion and finds ITS2 regions for 1678 of the dereplicated reads (i.e. 100%)

grep ">" test_ITSx_standalone_derep.ITS2.fasta | wc -l
1678

Thoughts?

Nicholas_Bokulich · July 23, 2019, 8:25pm

Welcome @ewmorr!

I am not the developer and do not know the answer to this problem — just cc:ing @Adam_Rivers to see if he can shed some light onto this issue.

One issue that has been reported here in the past is that itsxpress does not operate on sequences that are in the reverse or RC orientation (relative to the reference sequences). This results in empty outputs, as you described. Could this possibly be the cause of this disparity?

ewmorr · July 23, 2019, 9:26pm

Hi @Nicholas_Bokulich, thanks for your response!

Whoops, silly me, but that's probably it. Sequences are RC given the sequencing strategy. Will have to see about a fix.

Nicholas_Bokulich · July 23, 2019, 9:28pm

Thanks for confirming! You can use VSEARCH to reverse-complement your reads, then import to QIIME 2 etc.

If you go that route, would you mind confirming on this thread that itsxpress successfully works on the RC seqs? Thanks!

ewmorr · July 24, 2019, 5:52pm

An update on this @Nicholas_Bokulich and @Adam_Rivers . First, a quick note on the sequencing design. For this dataset we used the 5.8S-Fun and ITS4-Fun primers described by Taylor et al. (2016), with forward Illumina adapter in the reverse (ITS4-Fun) primer construct to reduce hairpin formation as described by Taylor. Just wanted to note because this issue may crop up for others if using similar protocols.

Reverse complementing the fastq files should work in theory, but results in a staggered merge (because the reads now start from the middle of the sequence). Both vsearch (with --fastq_allowmergestagger) and BBmerge (in background of itsxpress) then clip the overhangs, which is expected behavior, but in this case means clipping biologically valid sequences, not technical sequences. The remaining fragment size is reduced to median 121 bp (the overlap) in this dataset as opposed to median 333 bp from merging the original, non-RC, reads. There was a 68% ITSx extraction success rate on the vsearch derep'd reads (880 ITS2 seqs extracted from the 1291 input rep seqs), and in addition the ITSx positions file indicates that the 5.8S is severely truncated with only 30-40 bp remaining and 5.8S: No start for all 880 extracted sequences, and, more worrying, the ITS2 is truncated at the 3' end (LSU not found).

Running standalone itsxpress on the RC fastq files had similar results to vsearch+ITSx – no sequences written and several No ITS found errors in the log (though 312 No ITS errors written in this case compared to 8567 from the non-RC reads).


#RC fastq files
mkdir itsxpress_test-RC
vsearch --fastx_revcomp itsxpress_test/test_1_L001_R1_001.fastq.gz \
--fastqout itsxpress_test-RC/test-RC_1_L001_R1_001.fastq
vsearch --fastx_revcomp itsxpress_test/test_1_L001_R2_001.fastq.gz \
--fastqout itsxpress_test-RC/test-RC_1_L001_R2_001.fastq

#run itsxpress on RC
itsxpress --fastq itsxpress_test-RC/test-RC_1_L001_R1_001.fastq \
--fastq2 itsxpress_test-RC/test-RC_1_L001_R2_001.fastq \
--outfile test-RC_ITSx_R1.fastq --outfile2 test-RC_ITSx_R2.fastq \
--log test-RC_ITSx.log --region ALL --taxa 'All'

grep "No ITS" test-RC_ITSx.log | wc -l
312

wc -l test-RC_ITSx_R1.fastq
0 test-RC_ITSx_R1.fastq

wc -l test-RC_ITSx_R2.fastq
0 test-RC_ITSx_R2.fastq

So it seems like reverse complementing the fastq files before ITS extraction (with either method) is not a viable approach, unless we accept the truncated ITS2 (with variable trunc position based on ITS2 length) and low extraction success (presumably biased against longer ITS2).

Further, I understand from here that there may be other downstream problems with reversing the quality scores, e.g., in assumptions of dada2 (which is ultimately my goal for this analysis).

However, thinking about this further – giving the reads to the merging program in reverse order (e.g., vsearch --fastq_mergepairs seqs_R2.fastq.gz --reverse seqs_R1.fastq.gz ... should result in correct orientation (unless I'm being completely dense – possible!), because reverse read is sense strand 5'-3' and fwd is antisense 5'-3'. Trying this with the vsearch+ITSx protocol results in ITS2 sequences extracted in the correct orientation with 100% success. (Yay)

Trying this same stategy of reversing the sequence input order with standalone itsxpress results in fragment size of correct length (median 333) and does not give any No ITS stop or start sites... errors. But... still no sequences in the output files.


itsxpress --fastq itsxpress_test/test_1_L001_R2_001.fastq.gz \
--fastq2 itsxpress_test/test_1_L001_R1_001.fastq.gz \
--outfile test_ITSx_R2.fastq --outfile2 test_ITSx_R1.fastq \
--log test_ITSx_R1_R2_switched.log --region ALL --taxa 'All'

grep "No ITS" test_ITSx_R1_R2_switched.log | wc -l
0

wc -l test_ITSx_R2.fastq
0 test_ITSx_R2.fastq

wc -l test_ITSx_R1.fastq
0 test_ITSx_R1.fastq

The q2-itsxpress plugin similarly appears to produce no sequences using this method based on Cannot describe a DataFrame without columns error from demux summarize.

Just to confirm again that itsxpress is not locating ITS at all with the sequences merged on antisense strand (i.e., in RC orientation):


itsxpress --fastq itsxpress_test/test_1_L001_R1_001.fastq.gz \
--fastq2 itsxpress_test/test_1_L001_R2_001.fastq.gz \
--outfile test_ITSx_R1.fastq --outfile2 test_ITSx_R2.fastq \
--log test_ITSx_correct_read_order.log --region ALL --taxa 'All'

grep "No ITS" test_ITSx_correct_read_order.log | wc -l
8567

So, it looks to me like:

can confirm itsxpress is not handling the antisense sequences;
RC of input fastq files is not a viable solution unless read loss and truncation is acceptable;
itsxpress is finding ITS sequences using the (hacky) workaround of reversing read input order, but is still not writing sequences.

I also tried modifying the fastq sequence headers to indicate the R1 sequences as R2 and vice versa before input to itsxpress, but no luck there either. Thoughts? At this point I think I'm probably missing something silly...

ewmorr · July 24, 2019, 6:03pm

...and also just now seeing the issue regarding RC orientation sequences discussed in the comments here. Sorry for the repeat of the issue @Adam_Rivers, but maybe useful to know potential workarounds don't seem to be functioning...

Nicholas_Bokulich · July 24, 2019, 6:18pm

Thanks for following up @ewmorr!

Right — of course, if the goal here is to denoise with dada2 then RC is not a viable option. I think I was imaging using VSEARCH to RC joined reads, since RC before joining brings up a whole host of problems.

You could just reverse the order of your imports (i.e., import your forward read as reverse and vice versa) when importing to QIIME 2. This would allow you to use dada2 downstream (which requires unjoined reads). But this would replicate the latest problem you reported: where itsxpress finds the correct fragment size but outputs no reads...

But I just want to bring up that importing trick as a more streamlined option for going from raw data to q2-itsxpress, provided we can clear up what is occurring in itsxpress.

@Adam_Rivers please let us know what you make of this when you get a chance — another forum user recently had this same issue with q2-itsxpress when using this same primer pair, so I am very keen to figure out what is going wrong.

Adam_Rivers · July 24, 2019, 7:44pm

@ewmorr Thanks for the detailed description. ITSxpress does not currently support reverse orientation primer sets like those in Taylor et al. 2016. This is documented in a few spots but I should add it to the tutorial as well. Support for them is on the development roadmap however it is a more complicated issue than it appears due to the way that dereplication occurs on merged reads and the desire to retain the correct orientation for Dada2's learning model. I do not recommend doing a simple RC on the reads for the reasons you discuss.

To summarize the ITSxpress process:

step	software	methods
1	bbmerge	merge reads
2	vsearch	the options --cluster_size \| --derep_fulllength are used to create representative sequences for end identification
3	hmmsearch	on representative sequences and all the HMM's from /ITSx ( orientation specific)
4	ITSxpress	Parsing hmmsearch report in Python
5	ITSXpress	apply the coordinates to from hmmsearch to trim the original, unmerged fastq reads

One thing I see in your analysis above is that you selected ALL as the region and you should select --region ITS2 for this primer set. selecting --taxa 'Fungi' will be faster too. that may be why you were not getting results.

Your idea of simple reversing the input strands (or importing them reversed is an interesting one, that we had considered back in December. I would need to think about the parsing of this a bit more because when you reverse the forward and the reverse reads the positions we cut from based on the HMM coordinates change.

I was planning to deal with this when I have time to do a bigger release that combines ITSxpress and q2-itsxpress, but I could play around with it a bit now and see if i can add a -rc flag to change the parsing behavior.

Would you be willing to share some example data for me to work with? Maybe 2 samples with 5000 thousand read pairs?

Nicholas_Bokulich · July 24, 2019, 8:03pm

What about just swapping the forward and reverse reads at importing, as I proposed above — I do not see how this is functionally different from importing reads in the correct orientation, except that (I think) the reads would be forward complement to what standard ITS primers yield:

Is that right? Does itsxpress not support forward complement? So the reads need to be in the correct orientation and on the correct strand?

A quicker test dataset may be to use Taylor's own mock community data, which are used in this tutorial:

Adam_Rivers · July 24, 2019, 8:21pm

Good Idea to use Taylor's mock data. I think switching the inputs as you suggest would work but if I recall from trying that a while back I needed to adjust the way I did the parsing. I don't remember the details, I was going to tackle this before the December-January government shutdown and I never got back to it. I'll give it a try and see.

ewmorr · July 24, 2019, 8:51pm

Hi @Adam_Rivers.

Sure! Happy to if it's still useful. test_AR.qza (1.3 MB) Attached is a qiime2 artifact with two samples, 5K reads each. These are not the files I was testing on FYI (one of the same samples, but subsampled).

Yes, agreed... I think this was left over for troubleshooting and trying to keep most permissive settings.

I was thinking about this earlier and convinced myself that inputting the reads in reverse order should result in "correct" orientation... because the sequencing is still reading 5'-3' on either strand? I think the Taylor PCR primers are still constructed as normal, just that the fwd Illumina adaptor is on the reverse primer (which is complementary to the main strand). BUT inputting the files to qiime in reverse (or standalone itsxpress for that matter) doesn't solve the problem so maybe they are complemented.

For example, below still gives a no columns error when visualized.

gunzip -c itsxpress_test/test_1_L001_R1_001.fastq.gz > itsxpress_test_switched_R1-R2/test_1_L001_R2_001.fastq
gunzip -c itsxpress_test/test_1_L001_R2_001.fastq.gz > itsxpress_test_switched_R1-R2/test_1_L001_R1_001.fastq

sed -i .bak 's/2:N:0:/1:N:0:/g' itsxpress_test_switched_R1-R2/test_1_L001_R1_001.fastq
sed -i .bak 's/1:N:0:/2:N:0:/g' itsxpress_test_switched_R1-R2/test_1_L001_R2_001.fastq
rm itsxpress_test_switched_R1-R2/*.bak

gzip itsxpress_test_switched_R1-R2/test_1_L001_R1_001.fastq
gzip itsxpress_test_switched_R1-R2/test_1_L001_R2_001.fastq

qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path itsxpress_test_switched_R1-R2 \
--input-format CasavaOneEightSingleLanePerSampleDirFmt \
--output-path test.qza

qiime itsxpress trim-pair-output-unmerged --i-per-sample-sequences test.qza --p-region ALL --p-taxa ALL --o-trimmed itsx_trimmed.qza

qiime demux summarize --i-data itsx_trimmed.qza --o-visualization itsx_trimmed.qzv --verbose

ewmorr · July 25, 2019, 2:37pm

One more update on this @Adam_Rivers and @Nicholas_Bokulich (sorry if this is getting spammy).

I tried the reverse read input strategy to qiime import, then itsxpress with the Taylor dataset this morning, and it seems to work. Except the read quality tails so badly in that dataset only ~1300 sequences are extracted.

Trying again with the test data I sent you yesterday, @Adam_Rivers, we do indeed end up with reasonable results with the reverse input strategy. demux-ITS.qzv (292.1 KB) 96% of reads pass (and the fails are probably at merging based on the BBmerge logs). Again, I think this should result in main strand, fwd orientation, so I would think no problem for parsing (shouldn't the tool essentially be blind to this?), but please let me know if you see otherwise.

As for why I was not getting output with this strategy yesterday...

It appears you are correct. Running with ITS2 selected results in output, whereas ALL gives no output. Thanks for pointing that out, but why? I would have expected just reporting ITS1 as missing rather than failing entirely... but now that I know this is fine.

Anyways, thanks for all the pointers on this. I'll work the reverse input reads through some additional analysis in the coming days .

Adam_Rivers · July 25, 2019, 3:11pm

Running with ITS2 selected results in output, whereas ALL gives no output. Thanks for pointing that out, but why? I would have expected just reporting ITS1 as missing rather than failing entirely… but now that I know this is fine.

All selects the whole region (ITS1 the 5.8s and ITS2) not both ITS regions. Each merged read is searched against HMMs for all ITS1 and ITS 2 beginning and end positions, but if we are looking for ITS 1 both the ITS 1 beginning hmm and ITS1 end hmm need to be present. All looks for the ITS1 beginning hmm and the ITS2 end hmm. since the ITS2 end is not present it returns nothing.

Practically speaking, All only works for long read technologies which need to be processed with the single end method. I should probably remove it as an option on paired end analysis.

ewmorr · July 25, 2019, 3:25pm

I see. Thanks for the clarification!

Adam_Rivers · July 31, 2019, 5:14pm

@ewmorr I have added a --reversed-primers option to itsxpress and q2-itsxpress. I still need to write some tests before I version it and release it. If you want to try it out prerelease you can. Create a new Qiime conda environment according to their instructions then run conda install hmmer bbmap vsearch. Install itsxpress and q2-itsxpress with git (git clone https://github.com/USDA-ARS-GBRU/q2_itsxpress.git and git clone https://github.com/USDA-ARS-GBRU/itsxpress.git) and switch to the branch issue_11 for itsxpress and reversed_primers for q2-itsxpress. Finally cd into the q2-itsxpress and itsxpress directories and run pip install -e . in each one.

ewmorr · July 31, 2019, 6:24pm

Awesome, thanks @Adam_Rivers. I'll try this out over the next couple of days.

Adam_Rivers · August 12, 2019, 1:24am

Did you have a chance to try this out?

ewmorr · August 12, 2019, 3:36pm

Hi @Adam_Rivers, yes, sorry I got pulled away on other things, but I did try this.

Seems like it is working fine – using the --reversed_primers flag itsxpress has a near 100% extraction rate on a test sample, with 8580 sequences returned from 8581 merged read pairs. ITSx has similar extraction rate (100%) on merged dereplicated reads on the same sample, but on complementary strand. Also, the results between the built in method and a simple reverse order input are the same, with same numbers of reads extracted across 25 samples.

Thanks for implementing this!

itsxpress does not extract sequences where ITSx does

I'm having a problem with itsxpress qiime2 plugin wherein no sequences are produced. This may be related to the questions here and here .

Run itsxpress plugin

demux summarize gives an error.

There are no reads in the files.

standalone itsxpress

output

...

There are no sequences in the output files...

There is not much informative (to me) in the log file aside from DEBUG No ITS stop or start sites were identified for sequence ...

Replicating the itsxpress pipeline as I understand it with standalone vsearch and ITSx

output

Derep with vsearch

output

Run standalone ITSx

ITSx runs to completion and finds ITS2 regions for 1678 of the dereplicated reads (i.e. 100%)

So it seems like reverse complementing the fastq files before ITS extraction (with either method) is not a viable approach, unless we accept the truncated ITS2 (with variable trunc position based on ITS2 length) and low extraction success (presumably biased against longer ITS2).

So, it looks to me like:

There is not much informative (to me) in the log file aside from `DEBUG No ITS stop or start sites were identified for sequence ...`