ITSxpress performance in different ITS1 amplicons


I am using q2-amplicon-2024.2 and installed itsxpress as described in the corresponding q2-library document. The version are:

QIIME 2 Plugin 'q2_itsxpress' version 1.8.1 (from package 'q2-itsxpress' version 1.8.1)
QIIME 2 Plugin 'itsxpress' version 2.0.0 (from package 'itsxpress' version 2.0.0)

I am using qiime itsxpress (not qiime q2-itsxpress, right?) within the conda env with fastq datasets obtained by ITS1F-ITS2 primer, or an alternative primer combination ITS1-30F/ITS1-217R (Usyk et al 2017).
Each forward primer is located in the 18S rDNA, each reverse primer is located in the 5.8S rDNA, respectively (see screenshot for yeast). The reads are 300 bp R1 reads from the F primers towards the ITS1 region (they reach the R2 reverse primer in some taxa).
When I run qiime2 itsxpress trim-single with these two datasets, the ITS1 region is quite successfully extracted from the ITS1F/ITS2 amplicons (80% reads in the output with --p-cluster-id 0.995); however, most of the reads are filtered out from the ITS1-30F/ITS1-217R (--p-cluster-id 0.995).

The ITSxpress description in q2 states that this tool is 'identifying the start and stop sites using HMMSearch'. Is it important that both, the start AND stop sites of e.g. ITS1 must be identifiable in the amplicon reads? The ITS1-30F location is over 100 bp upstream (towards the 5') compared to ITS1F, and the 300 bp R reads will to a large proportion not reach to the end if ITS1 (in contrast to ITS1F/ITS2 300 bp R1 reads).
Could a failure to detect the ITS1 stop site be responsible for filtering-out most reads by itsxpress in the ITS1-30F/ITS1-217R dataset, but at much lower rate in the ITS1F/ITS2 dataset?
In other words: Can I use itsxpress if not the entire ITS1 region is covered in the reads?

Thanks for your comments,

@Adam_Rivers, are you able to help out with this question?

@arwqiime ITSxpress does search for both the start and the end of the ITS region, so if the end isn't found, that may be the reason. We can try and confirm this using the log file and dom table output from HMMsearch. This output is only exported in the standalone version of ITSxpress.

You should have itsxpress already installed as a standalone version, so you can try:

itsxpress --fastq single-end.fastq.gz --single_end --region ITS1 --taxa TAXA \
--log logfile.txt --outfile trimmed_reads.fastq.gz --threads 16 --keeptemp --tempdir ./ --cluster_id 0.995

This will give you a domtable file and a logfile. Can you send me both of them? [email protected]

Alternatively, if you're willing to send me maybe 1000 reads of a sample I run some tests and give you a more detailed answer.

1 Like

Did you receive some of my data by direct mail?

Hi @arwqiime,
If you want to send your data privately, Please send your data via a Direct Message on the forum to @seinarsson.

I used the email provided above. Dou you want me to send the data again to @seinarsson ?

Hi @arwqiime, sorry for the delay. Have you not gotten my responses to your emails? I think my responses might be getting filtered in either of our email servers. I tried emailing again but see below:

Can you describe to me why you are using the single-end run with the ITS1-30F primers? If we could just trim the "start" off of the reads, would that give you enough information for accurate taxonomic identification if the ITS region isn't fully covered, and is that what you're looking to do?
The "start" position is found in the temporary files, in domtbl.txt, for all reads. This large file is parsed into a dictionary, with the highest scoring hit (if it meets the scoring threshold) being chosen as the most accurate "start" position. It then does the same for the "end" position. However, if the "end" position isn't in the dictionary, since the read doesn't reach the start of the 5.8S gene, it returns a warning and the package moves on to the next read. So you do have that data already but it isn't being used by the package.

1 Like